1 Transfer Learning in Deep Reinforcement Learning: A Survey

1

Transfer Learning in Deep ReinforcementLearning: A Survey

Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou

Abstract—Reinforcement Learning (RL) is a key technique to address sequential decision-making problems and is crucial to realizeadvanced artificial intelligence. Recent years have witnessed remarkable progress in RL by virtue of the fast development of deep neuralnetworks. Along with the promising prospects of RL in numerous domains, such as robotics and game-playing, transfer learning hasarisen as an important technique to tackle various challenges faced by RL, by transferring knowledge from external expertise toaccelerate the learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in thecontext of deep reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learningapproaches, under which we analyze their goals, methodologies, compatible RL backbones, and practical applications. We also drawconnections between transfer learning and other relevant topics from the RL perspective and explore their potential challenges as well asopen questions that await future research progress.

Index Terms—Transfer Learning, Reinforcement Learning, Deep Learning, Survey.

F

1 INTRODUCTION

R Einforcement Learning (RL) is an effective frameworkto solve sequential decision-making tasks, where a

learning agent interacts with the environment to improveits performance through trial and error [1]. Originatedfrom cybernetics and thriving in Computer Science, RL hasbeen widely applied to tackle challenging tasks which werepreviously intractable [2, 3].

As a pioneering technique for realizing advanced arti-ficial intelligence, traditional RL was mostly designed fortabular cases, which provided principled solutions to simpletasks but faced difficulties when handling highly complexdomains, e.g. tasks with 3D environments. Over the recentyears, an integrated framework, where an RL agent is builtupon deep neural networks, has been developed to addressmore challenging tasks. The combination of deep learningwith RL is hence referred to as Deep Reinforcement Learning(DRL) [4], which aims to address complex domains thatwere otherwise unresolvable by building deep, powerfulfunction approximators. DRL has achieved notable successin applications such as robotics control [5, 6] and gameplaying [7]. It also has a promising prospects in domains suchas health informatics [8], electricity networks [9], intelligenttransportation systems[10, 11], to name just a few.

Besides its remarkable advancement, RL still faces in-triguing difficulties induced by the exploration-exploitationdilemma [1]. Specifically, for practical RL, the environmentdynamics are usually unknown, and the agent cannot exploitits knowledge about the environment to improve its perfor-mance until enough interaction experiences are collected viaexploration. Due to partial observability, sparse feedbacks,and the high-dimension in state and action spaces, acquiringsufficient interaction samples can be prohibitive, which may

• Zhuangdi Zhu and Jiayu Zhou are with the Department of ComputerScience and Engineering, Michigan State University, East Lansing, MI,48823. E-mail: [email protected], [email protected]

• Kaixiang Lin is with the Amazon Alexa AI. E-mail: [email protected]

even incur safety concerns for domains such as automatic-driving and health informatics, where the consequences ofwrong decisions can be too high to take. The abovementionedchallenges have motivated various efforts to improve thecurrent RL procedure. As a result, Transfer Learning (TL),which is a technique to utilize external expertise from othertasks to benefit the learning process of the target task,becomes a crucial topic in RL.

TL techniques have been extensively studied in the super-vised learning domain [12], whereas it is an emerging topicin RL. In fact, TL under the framework of RL can be morecomplicated in that the knowledge needs to transfer in thecontext of a Markov Decision Process (MDP). Moreover, dueto the delicate components of an MDP, the expert knowledgemay take different forms, which need to transfer in differentways. Noticing that previous efforts on summarizing TL forRL have not covered its most recent advancement [13, 14],in this survey, we make a comprehensive investigationof Transfer Learning in Deep Reinforcement Learning. Es-pecially, we build a systematic framework to categorizethe state-of-the-art TL techniques into different sub-topics,review their theories and applications, and analyze theirinter-connections.

The rest of this survey is organized as follows: In section 2,we introduce the preliminaries of RL and its key algorithms,including those recently designed based on deep neuralnetworks. Next, we clarify the definition of TL in the contextof RL, and discuss its relevant research topics (Section2.4). In Section 3, we provide a framework to categorizeTL approaches from multiple perspectives, analyze theirfundamental differences, and summarize their evaluationmetrics (Section 3.3). In Section 4, we elaborate on differentTL approaches in the context of DRL, organized by the formatof transferred knowledge, such as reward shaping (Section4.1), learning from demonstrations (Section 4.2), or learningfrom teacher policies (Section 4.3). We also investigate TLapproaches by the way that knowledge transfer occurs, such

arX

iv:2

009.

0788

8v4

[cs

.LG

] 4

Mar

202

1

2

as inter-task mapping (Section 4.4), or learning transferrablerepresentations (Section 4.5), etc. We discuss the recentapplications of TL in the context of DRL in Section 5 andprovide some future perspectives and open questions inSection 6.

2 DEEP REINFORCEMENT LEARNING AND TRANS-FER LEARNING

In this section, we provide a brief overview of the recentdevelopment in RL and the definitions of some key ter-minologies. Next, we provide categorizations to organizedifferent TL approaches, then point out some of the othertopics in the context of RL, which are relevant to TL but willnot be elaborated in this survey.

Remark 1. Without losing clarify, for the rest of this survey, werefer to MDPs, domains, and tasks equivalently.

2.1 Reinforcement Learning PreliminariesA typical RL problem can be considered as training anagent to interact with an environment that follows a MarkovDecision Process (MPD) [15]. For each interaction with theMDP, the agent starts with an initial state and performs anaction accordingly, which yields a reward to guide the agentactions. Once the action is taken, the MDP transits to thenext state by following the underlying transition dynamics ofthe MDP. The agent accumulates the time-discounted rewardsalong with its interactions with the MDP. A subsequenceof interactions is referred to as an episode. For MDPs withinfinite horizons, one can assume that there are absorbingstates, such that any action taken upon an absorbing statewill only lead to itself and yields zero rewards. All above-mentioned components in the MDP can be represented usinga tupleM = (µ0,S,A, T , γ,R,S0), in which:

• µ0 is the set of initial states.• S is the state space.• A is the action space.• T : S × A × S → R is the transition probability

distribution, where T (s′|s, a) specifies the probabilityof the state transitioning to s′ upon taking action afrom state s.

• R : S×A×S → R is the reward distribution, whereR(s, a, s′) is the reward an agent can get by takingaction a from state s with the next state being s′.

• γ is a discounted factor, with γ ∈ (0, 1].• S0 is the set of absorbing states.

An RL agent behaves in M by following its policy π,which is a mapping from states to actions: π : S → A . Forstochastic policies, π(a|s) denotes the probability for agentto take action a from state s. Given an MDPM and a policyπ, one can derive a value function V πM(s), which is definedover the state space:

V πM(s) = E[r0 + γr1 + γ2r2 + . . . ;π, s

],

where ri = R(si, ai, si+1) is the reward that an agentreceives by taking action ai in the i-th state si, and thenext state transits to si+1. The expectation E is taken overs0 ∼ µ0, ai ∼ π(·|si), si+1 ∼ T (·|si, ai). The value-functionestimates the quality of being in state s, by evaluating the

expected rewards that an agent can get from s, given thatthe agent follows policy π in the environmentM afterward.Similar to the value-function, each policy also carries a Q-function, which is defined over the state-action space toestimate the quality of taking action a from state s:

QπM(s, a) = Es′∼T (·|s,a) [R(s, a, s′) + γV πM(s′)] .

The objective for an RL agent is to learn an optimalpolicy π∗M to maximize the expectation of accumulatedrewards, so that: ∀s ∈ S, π∗M(s) = arg max

a∈AQ∗M(s, a),

where Q∗M(s, a) = supπ

QπM(s, a).

2.2 Reinforcement Learning Algorithms

In this section, we review the key RL algorithms developedover the recent years, which provide cornerstones for the TLapproaches discussed in this survey.

Prediction and Control: any RL problem can be disas-sembled into two subtasks: prediction and control [1]. In theprediction phase, the quality of the current policy is beingevaluated. In the control phase, which is also referred to asthe policy improvement phase, the learning policy is adjustedbased on evaluation results from the prediction step. Policiescan be improved by iteratively conducting these two steps,which is therefore called policy iteration.

Policy iterations can be model-free, which means that thetarget policy is optimized without requiring knowledgeof the MDP transition dynamics. Traditional model-freeRL includes Monte-Carlo methods, which uses samples ofepisodes to estimate the value of each state based on completeepisodes starting from that state. Monte-Carlo methods canbe on-policy if the samples are collected by following thetarget policy, or off-policy if the episodic samples are collectedby following a behavior policy that is different from the targetpolicy.

Temporal-Difference Learning, or TD-learning for short,is an alternative to Monte-Carlo for solving the predictionproblem. The key idea behind TD-learning is to learn thestate quality function by bootstrapping. It can also be extendedto solve the control problem so that both value function andpolicy can get improved simultaneously. TD-learning is oneof the most widely used RL paradigms due to its simplicityand general applicability. Examples of on-policy TD-learningalgorithms include SARSA [16], Expected SARSA [17], Actor-Critic [18], and its deep neural extension named A3C [19].The off-policy TD-learning approaches include SAC [20]for continuous state-action spaces, and Q-learning [21] fordiscrete state-action spaces, along with its variants built ondeep-neural networks, such as DQN [22], Double-DQN [22],Rainbow [23], etc.

TD-learning approaches, such as Q-learning, focus moreon estimating the state-action value functions. Policy Gra-dient, on the other hand, is a mechanism that emphasizeson direct optimization of a parametrizable policy. Traditionalpolicy-gradient approaches include REINFORCE [24]. Recentyears have witnessed the joint presence of TD-learningand policy-gradient approaches, mostly ascribed to therapid development of deep neural networks. Representativealgorithms along this line include Trust region policy opti-mization (TRPO) [25], Proximal Policy optimization (PPO) [26],

3

Deterministic policy gradient (DPG) [27] and its extensions,such as DDPG [28] and Twin Delayed DDPG [29].

2.3 Transfer Learning in the Context of ReinforcementLearningLet Ms = {Ms|Ms ∈ Ms} be a set of source domains,which provides prior knowledge Ds that is accessible by thetarget domainMt, such that by leveraging the informationfrom Ds, the target agent learns better in the target domainMt, compared with not utilizing it. We use Ms ∈ Ms

to refer to a single source domain. For the simplistic case,knowledge can transfer between two agents within the samedomain, which results in |Ms| = 1,Ms =Mt. We providea more concrete description of TL from the RL perspectiveas the following:

Remark 2. [Transfer Learning in the Context of Rein-forcement Learning] Given a set of source domains Ms ={Ms|Ms ∈Ms} and a target domainMt, Transfer Learningaims to learn an optimal policy π∗ for the target domain, byleveraging exterior information Ds from Ms as well as interiorinformation Dt fromMt, s.t.:

π∗ = arg maxπ

Es∼µt0,a∼π[QπM(s, a)],

where π = φ(Ds ∼Ms,Dt ∼ Mt) : St → At is a functionmapping from the states to actions for the target domain Mt,learned based on information from both Dt and Ds.

In the above definition, we use φ(D) to denote the learnedpolicy based on information D. Especially in the context ofDRL, the policy π is learned using deep neural networks.One can consider regular RL without transfer learning as aspecial case of the above definition by treating Ds = ∅, sothat a policy π is learned purely on the feedback providedby the target domain, i.e. π = φ(Dt).

2.4 Related TopicsIn addition to TL, other efforts have been made to benefit RLby leveraging different forms of supervision, usually underdifferent problem settings. In this section, we briefly discussother techniques that are relevant to TL, by analyzing thedifferences, as well as the connections between TL and theserelevant techniques, which we hope can further clarify thescope of this survey.

Imitation Learning, also known as Apprenticeship Learn-ing, aims to train a policy to mimic the behavior of anexpert policy, given that only a few demonstrations fromthat expert are accessible. It is considered as an alternativeto RL to solve sequential decision-making problems whenthe environment feedbacks are unavailable [30, 31, 32]. Thereare currently two main paradigms for imitation learning.The first one is Behavior Cloning, in which a policy istrained in a supervise-learning manner, without access to anyreinforcement learning signal [33]. The second one is InverseReinforcement Learning, in which the goal of imitation learningis to recover a reward function of the domain that canexplain the behavior of the expert demonstrator [34]. ImitationLearning is closely related to TL and has been adapted as aTL approach called Learning from Demonstrations (LfD), whichwill be elaborated in Section 4.2. What distinguishes LfDand the classic Imitation Learning approaches is that LfD

still interacts with the domain to access reward signals, inthe hope of improving the target policy assisted by a fewexpert demonstrations, rather than recovering the ground-truth reward functions or the expert behavior. LfD can bemore effective than IL when the expert demonstrations areactually sub-optimal [35, 36].

Lifelong Learning, or Continual Learning, refers to theability to learn multiple tasks that are temporally or spatiallyrelated, given a sequence of non-stationary information. Thekey to acquiring Lifelong Learning is a tradeoff betweenobtaining new information over time and retaining thepreviously learned knowledge across new tasks. LifelongLearning is a technique that is applicable to both supervisedlearning [37] and RL [38, 39], and is also closely related to thetopic of Meta Learning [40]. Lifelong Learning can be a morechallenging task compared to TL, mainly because that itrequires an agent to transfer knowledge across a sequence ofdynamically-changing tasks which cannot be foreseen, ratherthan performing knowledge transfer among a fixed group oftasks. Moreover, the ability of automatic task detection canalso be a requirement for Lifelong Learning [41], whereas forTL the agent is usually notified of the emergence of a newtask.

Hierarchical Reinforcement Learning (HRL) has beenproposed to resolve real-world tasks that are hierarchical.Different from traditional RL, in an HRL setting, the actionspace is grouped into different granularities to form higher-level macro actions. Accordingly, the learning task is alsodecomposed into hierarchically dependent subgoals. Mostwell-known HRL frameworks include Feudal learning [42],Options framework[43], Hierarchical Abstract Machines [44], andMAXQ [45]. Given the higher-level abstraction on tasks,actions, and state spaces, HRL can facilitate knowledgetransfer across similar domains. In this survey, however,we focus on discussing approaches of TL for general RLtasks rather than HRL.

Multi-Agent Reinforcement Learning (MARL) has strongconnections with Game Theory [46]. Different from single-agent RL, MARL considers an MDP with multiple agentsacting simultaneously in the environment. It aims to solveproblems that were difficult or infeasible to be addressedby a single RL agent [47]. The interactive mode for multipleagents can either be independent, cooperative, competitive,or even a hybrid setting [48]. Approaches of knowledgetransfer for MARL fall into two classes: inter-agent transferand intra-agent transfer. We refer users to [49] for a morecomprehensive survey under this problem setting. Differentfrom their perspective, this survey emphasizes the general TLapproaches for a single agent scenario, although approachesmentioned in this survey may also be applicable to multi-agent MPDs.

3 ANALYZING TRANSFER LEARNING FROM MULTI-PLE PERSPECTIVES

In this section, we provide multi-perspective criteria toanalyze TL approaches in the context of RL and introducemetrics for evaluations.

4

3.1 Categorization of Transfer Learning Approaches

We point out that TL approaches can be categorized byanswering the following key questions:

1) What knowledge is transferred: Knowledge fromthe source domain can take different forms of su-pervisions, such as a set of expert experiences [50],the action probability distribution of an expert pol-icy [51], or even a potential function that estimatesthe quality of state and action pairs in the source ortarget MDP [52]. The divergence in knowledge rep-resentations and granularities fundamentally decidesthe way that TL is performed. The quality of thetransferred knowledge, e.g. whether it comes froman oracle policy [53] or is provided by a sub-optimalteacher [36], also affects the way TL methods aredesigned.

2) What RL frameworks are compatible with the TLapproach: We can rephrase this question into otherforms, e.g., is the TL approach policy-agnostic, or does itonly apply to certain types of RL backbones, such as theTemporal Difference (TD) methods? Answers to thisquestion are closely related to the format of thetransferred knowledge. For example, transferringknowledge in the form of expert demonstrationsare usually policy-agnostic (see Section 4.2), whilepolicy distillation, as will be discussed in Section 4.3,may not be suitable for RL algorithms such as DQN,which does not explicitly learn a policy function.

3) What is the difference between the source and thetarget domain: As discussed in Section 2.3, thesource domain Ms is the place where the priorknowledge comes from, and the target domainMt

is where the knowledge is transferred to. Some TLapproaches are suitable for the scenario whereMs

andMt are equivalent, whereas others are designedto transfer knowledge between different domains.For example, in video gaming tasks where observa-tions are RGB pixels, Ms and Mt may share thesame action space (A) but differs in their observationspaces (S). For other problem settings, such as thegoal-conditioned RL [54], the two domains maydiffer only by the reward distribution: Rs 6= Rt.Such domain difference induces difficulties in trans-fer learning and affects how much knowledge cantransfer.

4) What information is available in the target do-main: While the cost of accessing knowledge fromsource domains is usually considered cheaper, it canbe prohibitive for the learning agent to access thetarget domain, or the learning agent can only have avery limited number of environment interactionsdue to a high sampling cost. Examples for thisscenario include learning an auto-driving agent aftertraining it in simulated platforms [55], or traininga navigation robot using simulated image inputsbefore adapting it to real environments [56]. Theaccessibility of information in the target domain canaffect the way that TL approaches are designed.

5) How sample-efficient the TL approach is: This ques-tion is related to question 4 regarding the accessi-

bility of a target domain. Compared with trainingfrom scratch, TL enables the learning agent withbetter initial performance, which usually needs fewinteractions with the target domain to converge to agood policy, guided by the transferred knowledge.Based on the number of interactions needed toenable TL, we can categorize TL techniques intothe following classes: (i) Zero-shot transfer, whichlearns an agent that is directly applicable to thetarget domain without requiring any interactionswith it; (ii) Few-shot transfer, which only requires afew samples (interactions) from the target domain;and (iii) Sample-efficient transfer, where an agent canbenefit from TL to learn faster with fewer interactionsand is therefore still more sample efficient comparedto RL without any transfer learning.

6) What are the goals of TL: We can answer this ques-tion by analyzing two aspects of a TL approach: (i)the evaluation metrics and (i) the objective function.Evaluation metrics can vary from the asymptoticperformance to the training iterations used to reach acertain performance threshold, which implies thedifferent emphasis of the TL approach. On theother hand, TL approaches may optimize towardsvarious objective functions augmented with differentregularizations, which is usually hinged on theformat of the transferred knowledge. For example,maximizing the policy entropy can be combinedwith the maximum-return learning objective in orderto encourage explorations when the transferredknowledge is imperfect demonstrations [57].

3.2 Case Analysis of Transfer LearningIn this section, we use HalfCheetah1, one of the standardRL benchmarks for solving physical locomotion tasks, as arunning example to illustrate how transfer learning can beperformed between the source and the target domain. Asshown in Figure 1, the objective of HalfCheetah is to train atwo-leg agent to run as fast as possible without losing controlof itself.

3.2.1 Potential Domain Differences:During TL, the differences between the source and targetdomain may reside in any component that forms an MDP.The source domain and the target domain can be different inany of the following aspects:

• S (State-space): domains can be made different byextending or constraining the available positions forthe agent to move.

• A (Action-space) can be adjusted by changing therange of available torques for the thigh, shin, or footof the agent.

• R (Reward function): a task can be simplified byusing only the distance moved forward as rewards orbe perplexed by using the scale of accelerated velocityin each direction as extra penalty costs.

• T (Transition dynamics): two domains can differ byfollowing different physical rules, leading to different

1. https://gym.openai.com/envs/HalfCheetah-v2/

5

Fig. 1: An illustration of the HalfCheetah domain. Thelearning agent aims to move forward as fast as possible

without losing its balance.

transition probabilities given the same state-actionpairs.

• µ0 (Initial states): the source and target domains mayhave different initial states, specifying where and withwhat posture the agent can start moving.

• τ (Trajectories): the source and target domains mayallow a different number of steps for the agent tomove before a task is done.

3.2.2 Transferrable Knowledge:

We list the following transferrable knowledge, assuming thatthe source and target domains are variants of the HalfCheetahbenchmark, although other forms of knowledge transfer mayalso be feasible:

• Demonstrated trajectories: the target agent can learnfrom the behavior of a pre-trained expert, e.g. asequence of running demonstrations.

• Model dynamics: the learning agent may access an ap-proximation model of the physical dynamics, whichis learned from the source domain but also applicablein the target domain. The agent can therefore performdynamic-programing based on the physical rules,running as fast as possible while avoid losing itscontrol due to the accelerated velocity.

• Teacher policies: an expert policy may be consultedby the learning agent, which outputs the probabilityof taking different actions upon a given state example.

• Teacher value functions: besides teacher policy, thelearning agent may also refer to the value functionderived by a teacher policy, which implies what state-actions are good or bad from the teacher’s point ofview.

3.3 Evaluation metrics

We enumerate the following representative metrics forevaluating TL approaches, some of which have also beensummarized in prior work [58],[13]:

• Jumpstart Performance (jp): the initial performance(returns) of the agent.

• Asymptotic Performance (ap): the ultimate performance(returns) of the agent.

• Accumulated Rewards (ar): the area under the learningcurve of the agent.

• Transfer Ratio (tr): the ratio between ap of the agentwith TL and ap of the agent without TL.

• Time to Threshold (tt): the learning time (iterations)needed for the target agent to reach certain perfor-mance threshold.

• Performance with Fixed Training Epochs (pe): the perfor-mance achieved by the target agent after a specificnumber of training iterations.

• Performance Sensitivity (ps): the variance in returnsusing different hyper-parameter settings.

The above criteria mainly focus on the learning processof the target agent. In addition, we introduce the follow-ing metrics from the perspective of transferred knowledge,which, although commensurately important for evaluation,have not been explicitly discussed by prior art:

• Necessary Knowledge Amount (nka): i.e. the necessaryamount of the knowledge required for TL in orderto achieve certain performance thresholds. Examplesalong this line include the number of designed sourcetasks, the number of expert policies, or the number ofdemonstrated interactions required to enable knowl-edge transfer.

• Necessary Knowledge Quality (nkq): the necessary qualityof the knowledge required to enable effective TL. Thismetric helps in answering questions such as (i) Doesthe TL approach rely on near-oracle knowledge fromthe source domain, such as expert-level demonstra-tions/policies, or (ii) is the TL technique feasible evengiven sub-optimal knowledge?

Metrics from the perspective of transferred knowledge areharder to standardize because TL approaches differ invarious perspectives, including the forms of transferredknowledge, the RL frameworks utilized to enable suchtransfer, and the difference between the source and the targetdomains. It may lead to biased evaluations by comparing TLapproaches from just one viewpoint. However, we believethat explicating these knowledge-related metrics will help indesigning more generalizable and efficient TL approaches.

In general, most of the abovementioned metrics can beconsidered as evaluating two abilities of a TL approach: theMastery and Generalization. Mastery refers to how well thelearned agent can ultimately perform in the target domain,while Generalization refers to the ability of the learning agentto quickly adapt to the target domain assisted by the trans-ferred knowledge. Metrics, such as ap, ar and tr, evaluatesthe ability of Mastery, whereas metrics of jp, ps, nka and nkqemphasizes more on the ability of Generalization. Metrics suchas tt, for example, can measure either the Mastery ability orthe Generalization ability, depending on the choice of differentthresholds. tt with a threshold approaching the optimalemphasizes more on the Mastery, while a lower thresholdmay focus on the Generalization ability. Equivalently, pe canalso focus on either side depending on the choice of thenumber of training epochs.

6

Fig. 2: An overview of different TL approaches, organized by the format of transferred knowledge.

4 TRANSFER LEARNING APPROACHES

In this section, we elaborate on various TL approaches andorganize them into different sub-topics, mostly by answeringthe question of “what knowledge is transferred”. For each typeof TL approach, we investigate them by following the othercriteria mentioned in Section 3. We start with the RewardShaping approach (Section 4.1), which is generally applicableto different RL algorithms while requiring minimal changesto the underline RL framework, and overlaps with the otherTL approaches discussed in this chapter. We also provide anoverview of different TL approaches discussed in this surveyin Figure 2.

4.1 Reward ShapingReward Shaping (RS) is a technique that leverages theexterior knowledge to reconstruct the reward distributions ofthe target domain to guide the agent’s policy learning. Morespecifically, in addition to the environment reward signals,RS learns a reward-shaping function F : S × S × A → Rto render auxiliary rewards, provided that the additionalrewards contain external knowledge to guide the agent forbetter action selections. Intuitively, an RS strategy will assignhigher rewards to more beneficial state-actions, which cannavigate the agent to desired trajectories. As a result, theagent will learn its policy using the newly shaped rewardsR′: R′ = R+ F , which means that RS has altered the targetdomain with a different reward function:

M = (S,A, T , γ,R))→M′ = (S,A, T , γ,R′).

Along the line of RS, Potential based Reward Shaping (PBRS)is one of the most classical approaches. [52] proposed PBRSto form a shaping function F as the difference between twopotential functions (Φ(·)):

F (s, a, s′) = γΦ(s′)− Φ(s),

where the potential function Φ(·) comes from the knowledgeof expertise and evaluates the quality of a given state.The structure of the potential difference addresses a cycledilemma mentioned in [59], in which an agent can getpositive rewards by following a sequence of states which

forms a cycle: {s1, s2, s3, . . . , sn, s1} with F (s1, a1, s2) +F (s2, a2, s3)+· · ·+F (sn, an, s1) > 0. Potential-based rewardshaping avoids this issue by making any state cycle mean-ingless, with

∑n−1i=1 F (si, ai, si+1) ≤ −F (sn, an, s1) ≤ 0.

It has been proved that, without further restrictions on theunderlying MDP or the shaping function F , PBRS is sufficientand necessary to preserve the policy invariance. Moreover,the optimal Q function in the original and transformed MDPare related by the potential function:

Q∗M′(s, a) = Q∗M(s, a)− Φ(s), (1)

which draws a connection between potential based reward-shaping and advantage-based learning approaches [60].

The idea of PBRS was extended to [61], which formulatedthe potential as a function over both the state and theaction space. This approach is called Potential Based state-action Advice (PBA). The potential function Φ(s, a) thereforeevaluates how beneficial an action a is to take from state s:

F (s, a, s′, a′) = γΦ(s′, a′)− Φ(s, a). (2)

One limitation of PBA is that it requires on-policy learning,which can be sample-inefficient, as in Equation (2), a′ is theaction to take upon state s is transitioning to s′ by followingthe learning policy. Similar to Equation (1), the optimal Qfunctions in both MDPs are connected by the differenceof potentials: Q∗M′(s, a) = Q∗M(s, a) − Φ(s, a). Once theoptimal policy inM′ is learned, the optimal policy inM canbe recovered:

ππM(s) = arg maxa∈A

(Q∗M(s, a)− Φ(s, a)).

Traditional RS approaches assumed a static potentialfunction, until [62] proposed a Dynamic Potential Based (DPB)approach which makes the potential a function of bothstates and time: F (s, t, s′, t′) = γΦ(s′, t′) − Φ(s, t).Theyproved that this dynamic approach can still maintain policyinvariance: Q∗M′(s, a) = Q∗M(s, a) − Φ(s, t),where t is thecurrent tilmestep. [63] later introduced a way to incorporateany prior knowledge into a dynamic potential functionstructure, which is called Dynamic Value-Function Advice(DPBA). The underline rationale of DPBA is that, given any

7

extra reward function R+ from prior knowledge, in orderto add this extra reward to the original reward function, thepotential function should satisfy:

γΦ(s′, a′)− Φ(s, a) = F (s, a) = R+(s, a).

If Φ is not static but learned as an extra state-action valuefunction overtime, then the Bellman equation for Φ is :

Φπ(s, a) = rΦ(s, a) + γΦ(s′, a′).

The shaping rewards F (s, a) is therefore the negation ofrΦ(s, a) : F (s, a) = γΦ(s′, a′) − Φ(s, a) = −rΦ(s, a). Thisleads to the approach of using the negation of R+ as the im-mediate reward to train an extra state-action value functionΦ and the policy simultaneously, with rΦ(s, a) = −R+(s, a).Φ will be updated by a residual term δ(Φ):

Φ(s, a)← Φ(s, a) + βδ(Φ),

where δ(Φ) = −R+(s, a) + γΦ(s′, a′)−Φ(s, a), and β is thelearning rate. Accordingly, the dynamic potential function Fbecomes:

Ft(s, a) = γΦt+1(s′, a′)− Φt(s, a).

The advantage of DPBA is that it provides a framework toallow arbitrary knowledge to be shaped as auxiliary rewards.

Efforts along this line mainly focus on designing differentshaping functions F (s, a), while little work has addressedthe question of what knowledge can be used to derive this potentialfunction. One work by [64] proposed to use RS to transferan expert policy from the source domain (Ms) to the targetdomain (Mt). This approach assumed the existence of twomapping functions, MS and MA, which can transform thestate and action from the source to the target domain. Thenthe augmented reward is just πs((MS(s),MA(a))), whichis the probability that the mapped state and action will betaken by the expert policy in the source domain. Anotherwork used demonstrated state-action samples from an expertpolicy to shape rewards [65]. Learning the augmented rewardinvolves a discriminator, which is trained to distinguish sam-ples generated by an expert policy from samples generatedby the target policy. The loss of the discriminator is appliedto shape rewards to incentivize the learning agent to mimicthe expert behavior. This work is a combination of two TLapproaches: RS and Learning from Demonstrations, the later ofwhich will be elaborated in Section 4.2.

Besides the single-agent and model-free RL scheme, therehave been efforts to apply RS to multi-agent RL [66], model-based RL [67], and hierarchical RL [68]. Especially, [66]extended the idea of RS to multi-agent systems, showingthat the Nash Equilibria of the underlying stochastic game isunchanged under a potential-based reward shaping structure.[67] applied RS to model-based RL, where the potentialfunction is learned based on the free space assumption, anapproach to model transition dynamics in the environment.[68] integrated RS to MAXQ, which is a hierarchical RLalgorithm framework, by augmenting the extra reward ontothe completion function of the MAXQ [45].

RS approaches discussed so far are built upon a consensusthat the source information for shaping the reward comesexternally, which coincides the notion of knowledge transfer.Some work of RS also considers the scenario where the

augmented reward comes intrinsically, such as the BeliefReward Shaping proposed by [69], which utilized a Bayesianreward shaping framework to generate the potential valuethat decays with experience, where the potential value comesfrom the critic itself.

The above RS approaches are summarized in Table 1. Ingeneral, most RS approaches follow the potential based RSprinciple that has been developed systematically: from theclassical PBRS which is built on a static potential shapingfunction of states, to PBA which generates the potential as afunction of both states and actions, and DPB which learns adynamic potential function of states and time, to the state-of-the-art DPBA, which involves a dynamic potential function ofstates and actions to be learned as an extra state-action valuefunction in parallel with the environment value function. Asan effective TL paradigm, RS has been widely applied tofields including robot training [70], spoken dialogue systems[71], and question answering [72]. It provides a feasibleframework for transferring knowledge as the augmentedreward and is generally applicable to various RS algorithms.How to integrate RS with other TL approaches, such asLearning from demonstrations (Section 4.2) and Policy Transfer(Section 4.3) to build the potential function for shaping willbe an intriguing question for the ongoing research.

4.2 Learning from Demonstrations

In this section, we review TL techniques in which the trans-ferred knowledge takes the form of external demonstrations.The demonstrations may come from different sources withdifferent qualities: it can be provided by a human expert,a previously learned expert policy, or even a suboptimalpolicy. For the following discussion, we use DE to denotea set of demonstrations, and each element in DE is a tupleof transition: i.e. (s, a, s′, r) ∈ DE . Efforts along this linemostly address a specific TL scenario, i.e. the source and thetarget MDPs are the same: Ms = Mt, although there hasbeen work that learns from demonstrations generated in adifferent domain [73].

In general, learning from demonstrations (LfD) is a techniqueto assist RL by utilizing provided demonstrations for moreefficient exploration. Knowledge conveyed in demonstra-tions encourages agents to explore states which can benefitits policy learning. Depending on when the demonstrationsare used for knowledge transfer, approaches can be orga-nized into offline methods and on-line methods. For offlineapproaches, demonstrations are used for pre-training RLcomponents before the RL learning step. RL componentssuch as the value function V (s) [74], the policy π [2], oreven the model of transition dynamics [75], are initializedby supervised learning using these demonstrations. For theonline approach, demonstrations are directly used in the RLstage to guide agent actions for efficient explorations [76].Most work discussed in this section follows the online trans-fer paradigm or combines the offline pre-training with theonline RL learning [77]. Depending on what RL frameworksare used to enable knowledge transfer, work in this domaincan be categorized into different branches: some adopts thepolicy-iteration framework [50, 78, 79], others follow a Q-learning framework [76, 80], while more recent work followsthe policy-gradient framework [36, 65, 77, 81].

8

Methods MDP Difference Format of shaping reward Knowledge sourcePBRS Ms =Mt F = γΦ(s′)− Φ(s) 7PBA Ms =Mt F = γΦ(s′, a′)− Φ(s, a) 7DPB Ms =Mt F = γΦ(s′, t′)− Φ(s, t) 7

DPBA Ms =Mt Ft = γΦt+1(s′, a′) − Φt(s, a) ,Φ learned as an extra Q function

7

[64] Ss 6= St, As 6= At Ft = γΦt+1(s′, a′)− Φt(s, a) πs[65] Ms =Mt Ft = γΦt+1(s′, a′)− Φt(s, a) DE

TABLE 1: A comparison of reward shaping approaches. 7 denotes that the information is not revealed in the paper.

Demonstration data have been applied in the PolicyIterations framework by [82]. Later, [78] introduced theDirect Policy Iteration with Demonstrations (DPID) algorithm.This approach samples complete demonstrated rollouts DE

from an expert policy πE , in combination with the self-generated rollouts Dπ gathered from the learning agent.Dπ ∪DE are used to learn a Monte-Carlo estimation of theQ-value: Q, from which a learning policy can be derivedgreedily: π(s) = arg max

a∈AQ(s, a). This policy π is further

regularized by a loss function L(s, πE) to minimize itsdiscrepancy from the expert policy decision:

L(π, πE) =1

NE

NE∑i=1

1{πE(si) 6= π(si)},

where NE is the number of expert demonstration samples,and 1(·) is an indicator function.

Another work along this line includes the ApproximatePolicy Iteration with Demonstration (APID) algorithm, whichwas proposed by [50] and extended by [79]. Different fromDPID where bothDE andDπ were used for value estimation,the APID algorithm applied only Dπ to approximate on theQ function. The expert demonstrations DE are used to learnthe value function, which, given any state si, renders expertactions πE(si) with higher Q-value margins compared withother actions that are not shown in DE :

Q(si, πE(si))− maxa∈A\πE(si)

Q(si, a) ≥ 1− ξi.

The term ξi is used to account for the case of imperfectdemonstrations. This value shaping idea is instantiated asan augmented hinge-loss to be minimized during the policyevaluation step:

Q← arg minQ

f(Q), where f(Q) ={Lπ(Q) +

α

NE

[1− (Q(si, πE(si))− max

a∈A\πE(si)Q(si, a))

]+

},

in which[z]+

= max{0, z} is the hinge loss, and Lπ(Q) isthe Q-function loss induced by an empirical norm of theoptimal bellman residual:

Lπ = E(s,a)∼Dπ‖TπQ(s, a)−Q(s, a)‖,

where T πQ(s, a) = R(s, a) + γEs′∼p(.|s,a)[Q(s′, π(s′))] isthe bellman contracting operator. [79] further extended thework of APID with a different evaluation loss:

Lπ = E(s,a)∼Dπ‖T∗Q(s, a)−Q(s, a)‖,

where T ∗Q(s, a) = R(s, a) + γEs′∼p(.|s,a)[maxa′Q(s′, a′)].

Their work theoretically convergence to the optimal Q

function compared with APID, as Lπ is minimizing theOptimal Bellman Residual instead of the empirical norm.

In addition to policy iteration, the following two ap-proaches integrate demonstration data into the TD-learningframework, such as Q-learning. Specifically, [76] proposedthe Deep Q-learning from Demonstration (DQfD) algorithm,which maintains two separate replay buffers to store demon-strated data and self-generated data, respectively, so thatexpert demonstrations can always be sampled with a certainprobability. Their work leverages the refined priority replaymechanism [83] where the probability of sampling a transi-tion i is based on its priority pi with a temperature parameterα: P (i) =

pαi∑k p

αk.

Another work under the Q-learning framework wasproposed by [80]. Their work, dubbed as LfDS, draws aclose connection to the Reward Shaping technique in Section4.1. It builds the potential function based on a set of expertdemonstrations, and the potential value of a given state-action pair is measured by the highest similarity betweenthe given pair and the expert experiences. This augmentedreward assigns more credits to state-actions that are moresimilar to expert demonstrations, which can eventuallyencourage the agent for expert-like behavior.

Besides Q-learning, recent work has integrated LfDinto the policy gradient framework [30, 36, 65, 77, 81]. Arepresentative work along this line is Generative AdversarialImitation Learning (GAIL), proposed by [30]. GAIL introducedthe notion of occupancy measure dπ , which is the stationarystate-action distributions derived from a policy π. Basedon this notion, a new reward function is designed suchthat maximizing the accumulated new rewards encouragesminimizing the distribution divergence between the occu-pancy measure of the current policy π and the expert policyπE . Specifically, the new reward is learned by adversarialtraining [53]: a discriminator D is trained to distinguishinteractions sampled from the current policy π and the expertpolicy πE :

JD = maxD:S×A→(0,1)

Edπ log[1−D(s, a)] +EdE log[D(s, a)]

Since πE is unknown, its state-action distribution dE is es-timated based on the given expert demonstrations DE . It hasbeen proved that, for a optimized discriminator, its outputsatisfies D(s, a) = dπ

dπ+dE. The output of the discriminator

is used as new rewards to encourage distribution matching,with r′(s, a) = − log(1−D(s, a)). The RL process is naturallyaltered to perform distribution matching by optimizing thefollowing minimax objective:

maxπ

minD

J(π,D) : = Edπ log[1−D(s, a)] +EdE log[D(s, a)].

9

Although GAIL is more related to Imitation Learningthan LfD, its philosophy of using expert demonstrationsfor distribution matching has inspired other LfD algorithms.For example, [81] extended GAIL with an algorithm calledPOfD, which combines the discriminator reward with theenvironment reward, so that the the agent is trained to maxi-mize the accumulated environment rewards (RL objective) aswell as performing distribution matching (imitation learningobjective):

maxθ

= Edπ [r(s, a)]− λDJS [dπ||dE ]. (3)

They further proved that optimizing Equation 3 is sameas a dynamic reward-shaping mechanism (Section 4.1):

maxθ

= Edπ [r′(s, a)],

where r′(s, a) = r(s, a) − λ log(Dw(s, a)) is the shapedreward.

Both GAIL and POfD are under an on-policy RL frame-work. To further improve the sample efficiency of TL,some off-policy algorithms have been proposed, such asDDPGfD [65] which is built upon the DDPG framework.DDPGfD shares a similar idea as DQfD in that they both usea second replay buffer for storing demonstrated data, andeach demonstrated sample holds a sampling priority pi. Fora demonstrated sample, its priority pi is augmented with aconstant bias εD > 0 in order to encourage more frequentsampling of expert demonstrations:

pi = δ2i + λ‖∇aQ(si, ai|θQ)‖2 + ε+ εD,

where δi is the TD-residual for transition i,‖∇aQ(si, ai|θQ)‖2 is the loss applied to the actor, and ε is asmall positive constant to ensure all transitions are sampledwith some probability.

Another work also adopted the DDPG framework tolearn from demonstrations [77]. Their approach differs fromDDPGfD in that its objective function is augmented witha Behavior Cloning Loss to encourage imitating on provideddemonstrations:

LBC =

|DE |∑i=1

||π(si|θπ)− ai||2.

To further address the issue of suboptimal demonstrations,in [77] the form of Behavior Cloning Loss is altered based onthe critic output, so that only demonstration actions withhigher Q values will lead to the loss penalty:

LBC =

|DE |∑i=1

‖π(si|θπ)− ai‖2 1[Q(si, ai) > Q(si, π(si))].

There are several challenges faced by LfD, one of whichis the imperfect demonstrations. Previous approaches usuallypresume near-oracle demonstrations. However, demonstra-tions can also be biased estimations of the environmentor even from a sub-optimal policy [36]. Current solutionsto imperfect demonstrations include altering the objectivefunction. For example, [50] leveraged the hinge-loss func-tion to allow occasional violations of the property thatQ(si, πE(si)) − max

a∈A\πE(si)Q(si, a) ≥ 1. Some other work

uses regularizations on the objective to alleviate overfitting

on biased data [76, 83]. A different strategy to confront thesub-optimality is to leverage those sub-optimal demonstra-tions only to boost the initial learning stage. Specifically,in the same spirit of GAIL, [36] proposed Self-AdaptiveImitation Learning (SAIL), which learns from sub-optimaldemonstrations using generative adversarial training whilegradually selecting self-generated trajectories with highqualities to replace less superior demonstrations.

Another challenge faced by LfD is overfitting: demonstra-tions may be provided in limited numbers, which results inthe learning agent lacking guidance on states that are unseenin the demonstration dataset. This challenge is aggravatedin MDPs with sparse reward feedbacks, as the learningagent cannot obtain much supervision information fromthe environment either. This challenge is also closely relatedto the covariate drift issue [84] which is commonly confrontedby approaches of behavior cloning. Current efforts to addressthis challenge include encouraging explorations by using anentropy-regularized objective [57], decaying the effects ofdemonstration guidance by softening its regularization onpolicy learning over time [35], and introducing disagreementregularizations by training an ensemble of policies based onthe given demonstrations, where the variance among policiesserves as a cost (negative reward) function [85].

We summarize the above-discussed approaches in Table 2.In general, demonstration data can help in both offline pre-training for better initialization and online RL for efficientexploration. During the RL learning phase, demonstrationdata can be used together with self-generated data toencourage expert-like behaviors (DDPGfD, DQFD), to shapevalue functions (APID), or to guide the policy update in theform of an auxiliary objective function (PID,GAIL, POfD). Thecurrent RL framework used for LfD includes policy iteration,Q-learning, and policy gradient. Developing more generalLfD approaches that are agnostic to RL frameworks and canlearn from sub-optimal or limited demonstrations would bethe next focus for this research domain.

4.3 Policy Transfer

In this section, we review TL approaches of Policy Transfer,where the external knowledge takes the form of pretrainedpolicies from one or multiple source domains. Work dis-cussed in this section is built upon a many-to-one problemsetting, which we formularize as below:

Problem Setting. (Policy Transfer) A set of teacher policiesπE1

, πE2, . . . , πEK are trained on a set of source domains

M1,M2, . . . ,MK , respectively. A student policy π is learnedfor a target domain by leveraging knowledge from {πEi}Ki=1.

For the one-to-one scenario, which contains only oneteacher policy, one can consider it as a special case of theabove problem setting with K = 1. Next, we categorizerecent work of policy transfer into two techniques: policydistillation and policy reuse.

4.3.1 Transfer Learning via Policy DistillationThe term knowledge distillation was proposed by [86] as anapproach of knowledge ensemble from multiple teachermodels into a single student model. This technique is laterextended from the field of supervised-learning to RL. Since

10

Methods Optimality Guarantee Format of transferred demonstrations RL framework

DQfD 7 Cached transitions in the replay buffer DQNLfDS 7 Reward shaping function DQNGAIL 3 Reward shaping function: −λ log(1−D(s, a)) TRPOPOfD 3 Reward shaping function:

r(s, a)− λ log(1−D(s, a))TRPO,PPO

DDPGfD 3 Increasing sampling priority DDPG[77] 3 Increasing sampling priority and behavior

cloning lossDDPG

DPID 3 Indicator binary-loss : L(si) = 1{πE(si) 6=π(si)

API

APID 7 Hinge loss on the marginal-loss:[L(Q, π, πE)

]+

APIAPID extend 3 Marginal-loss: L(Q, π, πE) API

SAIL 7 Reward shaping function: r(s, a)− λ log(1−D(s, a)) DDPG

TABLE 2: A comparison of learning from demonstration approaches.

the student model is usually shallower than the teachermodel and can perform across multiple teacher tasks, policydistillation is also considered as an effective approach ofmodel compression [87] and multi-task RL [88].

The idea of knowledge distillation has been applied tothe field of RL to enable policy distillation. Conventionalpolicy distillation approaches transfer the teacher policy in asupervised learning paradigm [88, 89]. Specifically, a studentpolicy is learned by minimizing the divergence of actiondistributions between the teacher policy πE and studentpolicy πθ , which is denoted as H×(πE(τt)|πθ(τt)):

minθEτ∼πE

|τ |∑t=1

∇θH×(πE(τt)|πθ(τt))

.The above expectation is taken over trajectories sampled fromthe teacher policy πE , which therefore makes this approachteacher distillation. A representative example of work alongthis line is [88], in which N teacher policies are learnedfor N source tasks separately, and each teacher yields adataset DE = {si, qi}Ni=0 consisting of observations (states)s and vectors of the corresponding q-values q, such thatqi = [Q(si, a1), Q(si, a2), ...|aj ∈ A]. Teacher policies arefurther distilled to a single student agent πθ by minimizingthe KL-Divergence between each teacher policy πEi(a|s) andthe student policy πθ , approximated using the dataset DE :

minθDKL(πE |πθ) ≈

|DE |∑i=1

softmax(qEiτ

)ln

(softmax(qEi )

softmax(qθi )

).

An alternative policy distillation approach is called stu-dent distillation [51, 90], which is similar to teacher distillation,except that during the optimization step, the expectationis taken over trajectories sampled from the student policyinstead of the teacher policy:

minθEτ∼πθ

|τ |∑t=1

∇θH×(πE(τt)|πθ(τt))

.[51] provides a nice summarization of the related work

on both kinds of distillation approaches. While it is feasibleto combine both [84], we observe that more recent workfocuses on student distillation, which empirically showsbetter exploration ability compared to teacher distillation,especially when the teacher policy is deterministic.

From a different perspective, there are two approaches ofdistilling the knowledge from teacher policies to a student:(1) minimizing the cross-entropy loss between the teacherand student policy distributions over actions [90, 91]; and(2) maximizing the probability that the teacher policy willvisit trajectories generated by the student , i.e. maxθ P (τ ∼πE |τ ∼ πθ) [92, 93]. One example of approach (1) is the Actor-mimic algorithm [90]. This algorithm distills the knowledgeof expert agents into the student by minimizing the crossentropy between the student policy πθ and each teacherpolicy πEi over actions:

Li(θ) =∑

a∈AEi

πEi(a|s) logπθ (a|s),

where each teacher agent is learned based on DQN, whosepolicy is therefore derived from the Boltzmann distributionsover the Q-function output:

πEi(a|s) =eτ−1QEi (s,a)∑

a′∈AEieτ−1QEi (s,a

′).

An instantiation of approach (2) is the Distral algorithm[92], in which a centroid policy πθ is trained based onK teacher policies, with each teacher policy learned in asource domain Mi = {Si,Ai, Ti, γ,Ri)}, in the hope thatknowledge in each teacher πEi can be distilled to the centroidand get transferred to student policies. It assumes that boththe transition dynamics Ti and reward distributions Ri aredifferent across different source MDPs. A distilled policy(student) is learned to perform in different domains bymaximizing maxθ

∑Ki=1 J(πθ, πEi), where

J(πθ, πEi) =∑t

E(st,at)∼πθ

[∑t≥0

γt(ri(at, st)+

α

βlog πθ(at|st)−

1

βlog(πEi(at|st)))

],

in which both log πθ(at|st) and πθ are used as augmentedrewards. Therefore, the above approach also draws a closeconnection to Reward Shaping (Section 4.1). In effect, thelog πθ(at|st) term guides the learning policy πθ to yieldactions that are more likely to be generated by the teacherpolicy, whereas the entropy term − log(πEi(at|st) servesas a bonus reward for exploration. A similar approachwas proposed by [91] which only uses the cross-entropybetween teacher and student policy λH(πE(at|st)||πθ(at|st))to reshape rewards. Moreover, they adopted a dynamically

11

fading coefficient to alleviate the effect of the augmentedreward so that the student policy becomes independent ofthe teachers after certain optimization iterations.

4.3.2 Transfer Learning via Policy ReuseIn addition to policy distillation, another policy transferapproach is Policy Reuse, which directly reuses policies fromsource tasks to build the target policy.

The notion of policy reuse was proposed by [94], whichdirectly learns expert policies based on a probability distri-bution P , where the probability of each policy to be usedduring training is related to the expected performance gainof that policy in the target domain, denoted as Wi:

P (πEi) =exp (tWi)∑Kj=0 exp (tWj)

,

where t is a dynamic temperature parameter that increasesover time. Under a Q-learning framework, the Q-function oftheir target policy is learned in an iterative scheme: duringevery learning episode,Wi is evaluated for each expert policyπEi , and W0 is obtained for the learning policy, from whicha reuse probability P is derived. Next, a behavior policy issampled from this probability P . If an expert is sampled asthe behavior policy, the Q-function of the learning policyis updated by following the behavior policy in an ε-greedyfashion. Otherwise, if the learning policy itself is selected asthe behavior policy, then a fully greedy Q-learning updateis performed. After each training episode, both Wi andthe temperature t for calculating the reuse probability isupdated accordingly. One limitation of this approach is thatthe Wi, i.e. the expected return of each expert policy on thetarget task, needs to be evaluated frequently. This work wasimplemented in a tabular case, leaving the scalability issueunresolved.

More recent work by [95] extended the Policy Improvementtheorem [96] from one to multiple policies, which is namedas Generalized Policy Improvement. We refer its main theoremas follows:

Theorem. [Generalized Policy Improvement (GPI)]Let π1, π2, . . . , πn be n decision policies and let

Qπ1 , Qπ2 , . . . , Qπn be the approximations of their action-value functions, s.t:

∣∣∣Qπi(s, a) − Qπi(s, a)∣∣∣ ≤ ε ∀s ∈

S, a ∈ A, and i ∈ {1, 2, . . . , n}. Define π(s) =arg max

amaxiQπi(s, a), then:

Qπ(s, a) ≥ maxiQπi(s, a)− 2

1− γ ε

for any s ∈ S and a ∈ A, where Qπ is the action-value functionof π.

Based on this theorem, a policy improvement approachcan be naturally derived by greedily choosing the actionwhich renders the highest Q value among all policiesfor a given state. Another work along this line is [95],in which an expert policy πEi is also trained on a dif-ferent source domain Mi with reward function Ri, sothat QπM0

(s, a) 6= QπMi(s, a). To efficiently evaluate the Q-

functions of different source policies in the target MDP, adisentangled representation ψ(s, a) over the states and ac-tions is learned based on neural networks and is generalized

across multiple tasks. Next, a task (reward) mapper wi islearned, based on which the Q-function can be derived:

Qπi (s, a) = ψ(s, a)Twi.

[95] proved that the loss of GPI is bounded by the differencebetween the source and the target tasks. In addition topolicy-reuse, their approach involves learning a sharedrepresentation ψ(s, a), which is also a form of transferredknowledge and will be elaborated more in Section 4.5.2.

We summarize the abovementioend policy transfer ap-proaches in Table 3. In general, policy transfer can be realizedby knowledge distillation, which can be either optimizedfrom the student’s perspecive (student distillation), or fromthe teacher’s perspective (teacher distillation) Alternatively,teacher policies can also be directly reused to update thetarget policy. All approaches discussed so far presumed oneor multiple expert policies, which are always at the disposalof the learning agent. Questions such as How to leverageimperfect policies for knowledge transfer, or How to refer to teacherpolicies within a budget, are still open to be resolved by futureresearch along this line.

4.4 Inter-Task Mapping

In this section, we review TL approaches that utilize mappingfunctions between the source and the target domains to assistknowledge transfer. Research in this domain can be analyzedfrom two perspectives: (1) which domain does the mappingfunction apply to, and (2) how is the mapped representationutilized. Most work discussed in this section shares a commonassumption as below:

Assumption. [Existence of Domain Mapping] A one-to-one mapping exists between the source domain Ms =(µs0,Ss,As, T s, γs,Rs,Ss0) and the target domain Mt =(µt0,St,At, T t, γt,Rt,St0).

Earlier work along this line requires a given mappingfunction [58, 97]. One examples is [58] which assumes thateach target state (action) has a unique correspondence inthe source domain, and two mapping functions XS , XA

are provided over the state space and the action space,respectively, so that XS(St) → Ss, XA(At) → As. Basedon XS and XA, a mapping function over the Q-valuesM(Qs) → Qt can be derived accordingly. Another workis done by [97] which transfers advice as the knowledgebetween two domains. In their settings, the advice comesfrom a human expert who provides the mapping functionover the Q-values in the source domain and transfers it to thelearning policy for the target domain. This advice encouragesthe learning agent to prefer certain good actions over others,which equivalently provides a relative ranking of actions inthe new task.

More later research tackles the inter-task mapping prob-lem by automatically learning a mapping function [98, 99, 100].Most work learns a mapping function over the state spaceor a subset of the state space. In their work, state representa-tions are usually divided into agent-specific and task-specificrepresentations, denoted as sagent and senv , respectively.In [98] and [99], the mapping function is learned on theagent-specific sub state, and the mapped representation isapplied to reshape the immediate reward. For [98], the

12

Citation Transfer Approach MDP Difference RL framework Metrics[88] Distillation S,A DQN ap[89] Distillation S,A DQN ap, ps[90] Distillation S,A Soft Q-learning ap, ar, ps[92] Distillation S,A A3C ap, pe, tt[94] Reuse R Tabular Q-learning ap[95] Reuse R DQN ap, ar

TABLE 3: A comparison of policy transfer approaches.

invariant feature space mapped from sagent can be appliedacross agents who have distinct action space but share somemorphological similarity. Specifically, they assume that bothagents have been trained on the same proxy task, basedon which the mapping function is learned. The mappingfunction is learned using an encoder-decoder neural networkstructure [101] in order to reserve as much information aboutthe source domain as possible. While transferring knowledgefrom the source agent to the target agent on a new task, theenvironment reward is augmented with an shaped rewardterm to encourage the target agent to imitate the source agenton the embedded feature space:

r′(s, ·) = α∥∥f(ssagent; θf )− g(stagent; θg)

∥∥ ,where f(ssagent) is the agent-specific state in the sourcedomain, and g(stagent) is for the target domain.

[100] applied the Unsupervised Manifold Alignment (UMA)approach [102] to automatically learn the state mappingbetween tasks. In their approach, trajectories are collectedfrom both the source and the target domain to learn amapping between states. While applying policy gradientlearning, trajectories fromMt are first mapped back to thesource: ξt → ξs, then an expert policy in the source domainis applied to each initial state of those trajectories to generate

near-optimal trajectories∼ξs, which are further mapped to

the target domain:∼ξs →

∼ξt. The deviation between

∼ξt and ξt

are used as a loss to be minimized in order to improve thetarget policy. Similar ideas of using UMA to assist transferby inter-task mapping can also be found in [103] and [104].

In addition to approaches that utilizes mapping overstates or actions, [105] proposed to learn an inter-taskmapping over the transition dynamics space: S × A × S .Their work assumes that the source and target domainsare different in terms of the transition space dimensionality.Triplet transitions from both the source domain 〈ss, as, s′s〉and the target domain < st, at, s′t > are mapped to alatent space Z. Given the feature representation in Z withhigher dimensionality, a similarity measure can be appliedto find a correspondence between the source and targettask triplets. Triplet pairs with the highest similarity inthis feature space Z are used to learn a mapping func-tion X : 〈st, at, s′t〉 = X (〈ss, as, s′s〉). After the transitionmapping, states sampled from the expert policy in thesource domain can be leveraged to render beneficial statesin the target domain, which assist the target agent learningwith a better initialization performance. A similar idea ofmapping transition dynamics can be found in [106], which,however, requires stronger assumption on the similarityof the transition probability and the state representationsbetween the source and the target domains.

As summarized in Table 4, for TL approaches that utilizean inter-task mapping, the mapped knowledge can be (asubset of) the state space [98, 99], the Q function [58], or(representations of) the state-action-sate transitions [105].In addition to being directly applicable in the target do-main [105], the mapped representation can also be used as anaugmented shaping reward [98, 99] or a loss objective [100]in order to guide the agent learning in the target domain.

4.5 Representation Transfer

In this section, we review TL approaches in which thetransferred knowledge are feature representations, such asrepresentations learned for the value-function or Q-function.Approaches discussed in this section are developed based onthe powerful approximation ability of deep neural networksand are built upon the following consensual assumption:

Assumption. [Existence of Task-Invariance Subspace]The state space (S), action space (A), or even reward space

(R) can be disentangled into orthogonal sub-spaces, some of whichare task-invariant and are shared by both the source and targetdomains, such that knowledge can be transferred between domainson the universal sub-space.

We organize recent work along this line into twosubtopics: i) approaches that directly reuse representationsfrom the source domain (Section 4.5.1), and ii) approachesthat learn to disentangle the source domain representationsinto independent sub-feature representations, some of whichare on the universal feature space shared by both the sourceand the target domains (Section 4.5.2).

4.5.1 Reusing Representations

A representative work of reusing representations is [107],which proposed the progressive neural network structure toenable knowledge transfer across multiple RL tasks in aprogressive way. A progressive network is composed ofmultiple columns, where each column is a policy network fortraining one specific task. It starts with one single columnfor training the first task, and then the number of columnsincreases with the number of new tasks. While training on anew task, neuron weights on the previous columns are frozen,and representations from those frozen tasks are applied to thenew column via a collateral connection to assist in learningthe new task. This process can be mathematically generalizedas follows:

h(k)i = f

(W

(k)i h

(k)i−1 +

∑j<k

U(k:j)i h

(j)i−1

).

where h(k)i is the i-th hidden layer for task (column) k, W (k)

i

is the associated weight matrix, and U(k:j)i are the lateral

13

Citation Algorithm MDP Difference Mapping Function Usage of Mapping

[58] SARSA St 6= St,As 6= At

M(Qs)→ Qt Q value reuse

[97] Q-learning As 6= At,Rs 6= Rt

M(Qs)→ advice Relative Q ranking

[98] Generally Applicable Ss 6= St M(st)→ r′ Reward shaping[99] SARSA(λ) Ss 6= St Rs 6= Rt M(st)→ r′ Reward shaping[100] Fitted Value Iteration Ss 6= St M(ss)→ st Penalty loss on state deviation

from expert policy[106] Fitted Q Iteration Ss ×As 6= St ×At M

((ss, as, s′s)→ (st, at, s′t)

)Reduce random exploration

[105] No constraint Ss ×As 6= St ×At M((ss, as, s′s)→ (st, at, s′t)

)Reduce random exploration

TABLE 4: A comparison of inter-task mapping approaches.

connections from layer i− 1 of previous tasks to the currentlayer of task k.

Although progressive network is an effective multi-taskapproach, it comes with a cost of giant network structure,as the network grows proportionally with the number ofincoming tasks. A later framework called PathNet is proposedby [108] which alleviates this issue by using a networkwith a fixed size. PathNet contains pathways, which aresubsets of neurons whose weights contain the knowledge ofprevious tasks and are frozen during training on new tasks.The population of pathway is evolved using a tournamentselection genetic algorithm [109].

Another approach of reusing representations for TLis modular networks [110, 111, 112]. For example, [110]proposed to decompose the policy network into a task-specific module and agent-specific module. Specifically, let πbe a policy performed by any agent (robot) r over the taskMk as a function φ over states s, it can be decomposed intotwo sub-modules gk and fr:

π(s) := φ(senv, sagent) = fr(gk(senv), sagent),

where fr is the agent-specific module while gk is the task-specific module. Their central idea is that the task-specificmodule can be applied to different agents performingthe same task, which serves as a transferred knowledge.Accordingly, the agent-specific module can be applied todifferent tasks for the same agent.

A model-based approach along this line is [112], whichlearns a model to map the state observation s to a latent-representation z. Accordingly, the transition probability ismodeled on the latent space instead of the original statespace, i.e. zt+1 = fθ(zt, at), where θ is the parameter ofthe transition model, zt is the latent-representation of thestate observation, and at is the action accompanying thatstate. Next, a reward module learns the value-function aswell as the policy from the latent space z using an actor-criticframework. One potential benefit of this latent representationis that knowledge can be transferred across tasks that havedifferent rewards but share the same transition dynamics, inwhich case the dynamics module can be directly applied tothe target domain.

4.5.2 Disentangling RepresentationsMethods discussed in this section mostly focus on learning adisentangled representation. Specifically, we elaborate on TLapproaches that are derived from two techniques: SuccessorRepresentation (SR) and Universal Value Function Approximat-ing (UVFA).

Successor Representations (SR) is an approach to de-couple the state features of a domain from its rewarddistributions. It enables knowledge transfer across multipledomains: M = {M1,M2, . . . ,MK}, so long as the onlydifference among them is the reward distributions: Ri 6= Rj .SR was originally derived from neuroscience, until [113]proposed to leverage it as a generalization mechanism forstate representations in the RL domain.

Different from the v-value or Q-value that describesstates as dependent on the reward distribution of the MDP,SR features a state based on the occupancy measure of itssuccessor states. More concretely, the occupancy measure isthe unnormalized distribution of states or state-action pairsthat an agent will encounter when following policy π in theMDP [30]. Specifically, SR decomposes the value-function ofany policy into two independent components, ψ and R:

V π(s) =∑s′

ψ(s, s′)w(s′),

where w(s′) is a reward mapping function which maps statesto scalar rewards, and ψ is the SR which describes any states as the occupancy measure of the future occurred stateswhen following π:

ψ(s, s′) = Eπ

[ ∞∑i=t

γi−t1[Si = s′]|St = s

],

with 1[S = s′] = 1 as an indicator function.The successor nature of SR makes it learnable using

any TD-learning algorithms. Especially, [113] proved thefeasibility of learning such representation in a tabular case,in which the state transitions can be described using amatrix. SR was later extended by [95] from three perspectives:(i) the feature domain of SR is extended from states tostate-action pairs; (ii) deep neural networks are used asfunction approximators to represent the SR ψπ(s, a) andthe reward mapper w; (iii) Generalized policy improvement(GPI) algorithm is introduced to accelerate policy transfer formulti-tasks facilitated by the SR framework (See Section 4.3.2for more details about GPI). These extensions, however, arebuilt upon a stronger assumption about the MDP:

Assumption. [Linearity of Reward Distributions] The rewardfunctions of all tasks can be computed as a linear combination of afixed set of features:

r(s, a, s′) = φ(s, a, s′)>w, (4)

where φ(s, a, s′) ∈ Rd denotes the latent representation of thestate transition, and w ∈ Rd is the task-specific reward mapper.

14

Based on this assumption, SR can be decoupled from therewards when evaluating the Q-function of any policy π in ataskMi with a reward function Ri:

Qπi (s, a) = Eπ[rit+1 + γrit+2 + . . . |St = s,At = a]

= Eπ[φ>t+1wi + γφ>t+2wi + . . . |St = s,At = a]

= ψπ(s, a)>wi. (5)

The advantage of SR is that, when the knowledge ofψπ(s, a) inMs is observed, one can quickly get the perfor-mance evaluation of the same policy inMt by replacing ws

with wt: QπMt= ψπ(s, a)wt.

Similar ideas of learning SR as a TD algorithm on alatent representation φ(s, a, s′) can also be found in [114, 115].Specifically, the work of [114] was developed based on anassumption which is weaker than Equation (4): Insteadof requiring linearly-decoupled rewards, the latent spaceφ(s, a, s′) is learned in an encoder-decoder structure toensure that the information loss is minimized when mappingstates to the latent space. This structure, therefore, comeswith an extra cost of learning a decoder fd to reconstruct thestate: fd(φ(st)) ≈ st.

An intriguing question faced by the SR approach is: Isthere a way that evades the linearity assumption about rewardfunctions and still enables learning the SR without extra modularcost? An extended work of SR [116] answered this questionaffirmatively, which proved that the reward functions doesnot necessarily have to follow the linear structure, yetat the cost of a looser performance lower-bound whileapplying the GPI approach for policy improvement. Espe-cially, rather than learning a reward-agnostic latent featureφ(s, a, s′) ∈ Rd for multiple tasks, [116] aims to learn amatrix φ(s, a, s′) ∈ RD×d to interpret the basis functionsof the latent space instead, where D is the number of seentasks. Assuming k out of D tasks are linearly independent,this matrix forms k basis functions for the latent space.Therefore, for any unseen task Mi, its latent features canbe built as a linear combination of these basis functions,so as its reward functions ri(s, a, s′). Based on the ideaof learning basis-functions for a task’s latent space, theyproposed that learning φ(s, a, s′) can be approximated aslearning r(s, a, s′) directly, where r(s, a, s′) ∈ RD is a vectorof reward functions for each seen task:

r(s, a, s′) =[r1(s, a, s′); r2(s, a, s′), . . . , rD(s, a, s′)

].

Accordingly, learning ψ(s, a) for any policy πi in Mi

becomes equivalent to learning a collection of Q-functions:∼ψπi

(s, a) =[Qπi1 (s, a), Qπi2 (s, a), . . . , QπiD (s, a)

].

A similar idea of using reward functions as features torepresent unseen tasks is also proposed by [117], which,however, assumes the ψ and w as observable quantitiesfrom the environment.

Universal Function Approximation (UVFA) is an alter-native approach of learning disentangled state representa-tions [54]. Same as SR, UVFA allows transfer learning formultiple tasks which differ only by their reward functions(goals). Different from SR which focuses on learning areward-agnostic state representation, UVFA aims to find afunction approximator that is generalized for both states and

goals. The UVFA framework is built on a specific problemsetting:

Problem Setting. (Goal Conditional RL) Task goals are definedin terms of states, e.g. given the state space S and the goal spaceG, it satisfies that G ⊆ S .

One instantiation of this problem setting can be an agentexploring different locations in a maze, where the goals aredescribed as certain locations inside the maze. Under thisproblem setting, a UVFA module can be decoupled intoa state embedding φ(s) and a goal embedding ψ(g), byapplying the technique of matrix factorization [118] to areward matrix describing the goal-conditional task.

One merit of UVFA resides in its transferrable embeddingφ(s) across tasks which only differ by goals. Another isits ability of continual learning when the set of goals keepexpanding over time. On the other hand, a key challengeof UVFA is that applying the matrix factorization is time-consuming, which makes it a practical concern when per-forming matrix factorization on complex environments withlarge state space |S|. Even with the learned embeddingnetworks, the third stage of fine-tuning these networks viaend-to-end training is still necessary. Authors of the paperrefer to the OptSpac tool for matrix factorization [119].

UVFA has been connected to SR by [116], in which a setof independent rewards (tasks) themselves can be used asfeatures for state representations. Another extended workthat combines UVFA with SR is called Universal SuccessorFeature Approximator (USFA), proposed by [120]. Followingthe same linearity assumption about rewards as in Equation(4) , USFA is proposed as a function over a triplet of the state,action, and a policy embedding z:

φ(s, a, z) : S ×A×Rk → Rd,

where z is the output of a policy-encoding mapping z = e(π) :S ×A → Rk. Based on USFA, the Q-function of any policy πfor a task specified by w can be formularized as the productof a reward-agnostic Universal Successor Feature (USF) ψ anda reward mapper w:

Q(s, a,w, z) = ψ(s, a, z)>w.

The above Q-function representation is distinct from Equa-tion (5), as the ψ(s, a, z) is generalized over multiple policies,with each denoted by z. Facilitated by the disentangledrewards and policy generalization, [120] further introduceda generalized TD-error as a function over tasks w and policyz, which allows them to approximate the Q-function of anypolicy on any task using a TD-algorithm.

4.5.3 DiscussionWe provide a summary of the discussed work in this sectionin Table 5. In general, representation transfer can facilitatetransfer learning in many ways, and work along this lineusually shares certain assumptions about some task-invariantproperty. Most of them assume that tasks are differentonly in terms of their reward distributions while sharingthe same states (or actions or transitions) probabilities.Other stronger assumptions include (i) decoupled dynamics,rewards [95], or policies [120] from the Q-function represen-tations, and (ii) the feasibility of defining tasks in terms of

15

states [120]. Based on those assumptions, approaches suchas TD-algorithms [116] or matrix-factorization [54] becomeapplicable to learn such disentangled representations. Tofurther exploit the effectiveness of disentangled structure, wethink that generalization approaches, which allow changingdynamics or state distributions, are important future workthat is worth more attention in this domain.

As an intriguing research topic, there are unresolvedquestions along the line of representation transfer. One ishow to handle drastic changes of reward functions betweendomains. As discussed in [121], good policies in one MDPmay perform poorly in another due to the fact that beneficialstates or actions inMs may become detrimental inMt withtotally different reward functions. Especially, as discussedin the GPI work [95], the performance lower-bound is deter-mined by the reward function discrepancy while transferringknowledge across different tasks. Learning a set of basisfunctions [116] to represent unseen tasks (reward functions),or decoupling policies from Q function representation [120]may serve as a good start to address this issue, as theypropose a generalized latent space, from which differenttasks (reward functions) can be interpreted. However, thelimitation of this work is that it is not clear how many andwhat kind of sub-tasks need to be learned to make the latentspace generalizable enough to interpret unseen tasks.

Another question is how to generalize the representationframework to allow transfer learning across domains withdifferent dynamics (or state-action spaces). A learned SRmight not be transferrable to an MDP with different tran-sition dynamics, as the distribution of occupancy measurefor successor states no longer holds even when followingthe same policy. Potential solutions may include model-based approaches that approximate the dynamics directly ortraining a latent representation space for states using multipletasks with different dynamics for better generalization [122].Alternatively, TL mechanisms from the supervised learningdomain, such as meta-learning, which enables the abilityof fast adaptation to new tasks [40], or importance sam-pling [123], which can compensate for the prior distributionchanges [12], might also shed lights on this question.

5 APPLICATIONS

In this section, we summarize practical applications of RLbased TL techniques:

Robotics learning is an important topic in the RL domain.[124] provided a comprehensive summary of applying RLtechniques to robotics learning. Under the RL framework,a classical TL approach for facilitating robotics learning isrobotics learning from demonstrations, where expert demonstra-tions from humans or other robots are leveraged to teachthe learning robot. [125] provided a nice summarization ofapproaches in this topic.

Later there emerged a scheme of collaborative robotic train-ing [126]. By collaborative training, knowledge from differentrobots is transferred by sharing their policies and episodicdemonstrations with each other. A recent instantiation ofthis approach can be found in [127]. Their approach can beconsidered as a policy transfer across multiple robot agentsunder the DQN framework, which shares the demonstrationsin a pool and performing policy updates asynchronously.

Recent work of reinforcement robotics learning with TLapproaches emphasizes more on the ability of fast and robustadaptation to unseen tasks. A typical approach to achievethis property is to design and select multiple source domainsfor robust training so that a generalized policy trainedon those source tasks can be quickly transferred to targetdomains. Examples include the EPOPT approach proposedby [128], which is a combination of policy transfer via sourcedomain ensemble and learning from limited demonstrationsfor fast adaptation to the target task. Another applicationcan be found in [129], in which robust agent policies aretrained by a large number of synthetic demonstrations froma simulator to handles dynamic environments. Anotheridea for fast adaptation is to learn latent representationsfrom observations in the source domain that are generallyapplicable to the target domain, e.g. training robots usingsimulated 2D image inputs and applying the robot in real 3Denvironments. Work along this line includes [130], whichlearns the latent representation using 3D CAD models,and [131, 132] which are derived based on the Generative-Adversarial Network. Another example is DARLA [133],which is a zero-shot transfer approach to learn disentangledrepresentations that are against domain shifts.

Game Playing is one of the most representative testbedsfor TL and RL algorithms. Both the complexity and diversityof games for evaluating TL approaches have evolved overthe recent decades, from classical testbeds such as grid-world games to more complex game settings such as online-strategy games or video games with pixel GRB inputs. Arepresentative TL application in game playing is AlphaGo,which is an algorithm for learning the online chessboardgames using both TL and RL techniques [2]. AlphaGo isfirst trained offline using expert demonstrations and thenlearns to optimize its policy using Monte-Carlo Tree Search.Its successor, AlphaGo Master [3], even beat the world No.1ranked human player.

In addition to online chessboard games, TL approacheshave also performed well in video game playing. State-of-the-art video game platforms include MineCraft, Atari, andStarcraft. Especially, [134] designed new RL tasks under theMineCraft platform for a better comparison of different RLalgorithms. We refer readers to [135] for a survey of AI forreal-time strategy (RTS) games on the Starcraft platform, witha dataset available from [136]. Moreover, [137] provided acomprehensive survey on DL applications in video gameplaying, which also covers TL and RL strategies from certainperspectives. A large portion of TL approaches reviewedin this survey have been applied to the Atari [138] andother game above-mentioned platforms. Especially, OpenAItrained an Dota2 agent that can surpass human experts [139].We summarize the game applications of TL approachesmentioned in this survey in Table 6.

Natural Language Processing (NLP) research hasevolved rapidly along with the advancement of DL andRL. There is an increasing trend of addressing NLP problemsby leveraging RL techniques. Applications of RL on NLPrange widely, from Question Answering (QA) [140], Dialoguesystems [141], Machine Translation [142], to an integration ofNLP and Computer Vision tasks, such as Visual Question An-swering (VQA) [143], Image Caption [144], etc. Many of theseNLP applications have implicitly applied TL approaches,

16

Citation Representations Format Assumptions MDP Difference Learner Metrics

[107] Lateral connections topreviously learned net-work modules

N/A S,A A3C ap, ps

[108] Selected neural paths N/A S,A A3C ap[110] Task(agent)-specific net-

work moduleDisentangled state rep-resentation

S,A Policy Gradient ap

[112] Dynamic transitionsmodule learned onlatent representations ofthe state space

N/A S,A A3C ap, pe

[95] SF Reward function can belinearly decoupled

R DQN ap, ar

[114] Encoder-decoderlearned SF

N/A R DQN pe, ps

[116] Encoder-decoderlearned SF

Rewards can be repre-sented by set of basisfunctions

R Q(λ) ap, pe

[54] Matrix-factorized UF Goals are defined interms of states

R Tabular Q-learning ap, pe,ps

[120] Policy-encoded UF Reward function can belinearly decoupled;

R ε-greedy Q-learning ap, pe

TABLE 5: A comparison of TL approaches that transfer representations.

including learning from demonstrations, policy transfer, orreward shaping, in order to better tailor these RL techniquesas NLP solutions which were previously dominated bythe supervise-learning counterparts [145]. Examples in thisfiled include applying expert demonstration to build RLsolutions for Spoken Dialogue Systems [146], VQA [143]; orbuilding shaped rewards for Sequence Generation [147], SpokenDialogue Systems [71],QA [72, 148], and Image Caption [144],or transferring policies for Structured Prediction [149] andVQA [150], etc. We summarize information of the above-mentioned applications in Table 7.

Health Informatics is another domain that has bene-fited from the advancement of RL. RL techniques havebeen applied to solve many healthcare tasks, includingdynamic treatment regimes [151, 152], automatic medical di-agnosis [153, 154], health resource scheduling [155, 156], anddrug discovery and development, [157, 158], etc. An overview ofrecent achievements of RL techniques in the domain of healthinformatics is provided by [159]. Despite the emergence of RLapplications to address healthcare problems, only a limitednumber of them have utilized TL approaches, although wedo observe some applications that leverage prior knowledgeto improve the RL procedure. Specifically, [160] utilized Q-learning for drug delivery individualization. They integratedthe prior knowledge of the dose-response characteristicsinto their Q-learning framework and leveraged this priorknowledge to avoid unnecessary exploration. Some workhas considered reusing representations for speeding upthe decision-making process [161, 162]. For example, [161]proposed to highlight both the individual variability andthe common policy model structure for the individual HIVtreatment. [162] applied a DQN framework for prescribingeffective HIV treatments, in which they learned a latentrepresentation to estimate the uncertainty when transferringa pertained policy to the unseen domains. [163] consideredthe possibility of applying human-involved interactive RLtraining for health informatics. We consider TL combinedwith RL a promising integration to be applied in the domainof health informatics, which can further improve the learning

effectiveness and sample efficiency, especially given thedifficulty of accessing large amounts of clinical data.

Others: Besides the above-mentioned topics, RL hasalso been utilized in many real-life applications. For ex-ample, applications in the Transportation Systems haveadopted RL for addressing traffic congestion issues withbetter traffic signal scheduling and transportation resource alloca-tion [10, 11, 164, 165]. We refer readers to [166] for a reviewof RL applications on traffic signal controls. RL and Deep RLare also effective solutions to problems in Finance, includingportfolio management [167, 168], asset allocation [169], tradingoptimization [170], etc. Another application is the ElectricitySystems, especially the intelligent electricity networks, whichcan benefit from RL techniques for improved power-deliverydecisions [171, 172] and active resource management [173].We refer readers to [9] for a comprehensive survey ofRL techniques for electric power system applications. Webelieve that given a plethora of theoretical and algorithmbreakthroughs in TL and DRL research, it is promising toembrace a hybrid framework of TL and DRL techniques forfurther accelerating the development of these domains.

6 FUTURE PERSPECTIVES

In this section, we propose some future research directionsfor TL in the RL domain:

Modeling Transferability: A key question for TL is,whether or to what extent can the knowledge for solving one taskhelp in solving another? Answering this question contributesto realizing automatic TL, including the source task selection,the mapping function design, disentangling representations,avoiding negative transfer, etc. By aggregating the problemsetting as well as the assumptions of the TL approachesdiscussed in this survey, we hereby introduce a frameworkthat theoretically enables transfer learning across different do-mains. More specifically, for a set of domains M = {Mi}Ki=1

with K ≥ 2, and ∀Mi ∈ M, Mi = (Si,Ai, T i,Ri, · · · ),it is feasible to transfer knowledge across these domains ifthere exist:

17

Game Citation TL Approach

Atari, 3D Maze [58] Transferrablerepresentation

Atari, Mujoco [81] LFDAtari [77] LFDGo [2] LFDKeepaway [58] MappingRoboCup [97] MappingAtari [88] Policy TransferAtari [89] Policy TransferAtari [90] Policy Transfer3D Maze [92] Policy TransferDota2 [139] Reward Shaping

TABLE 6: TL Applications in Game Playing.

Citation Application TL Approach

[146] Spoken Dialogue System LFD[143] VQA LFD[147] Sequence Generation LFD, Reward Shaping[72] QA Reward Shaping[148] QA LFD, Reward Shaping[149] Structured Prediction Policy Transfer[150] Grounded Dialog Generation Policy Distillation[144] Image Caption Reward Shaping

TABLE 7: Applications of TL approaches in NaturalLanguage Processing.

• A set of invertible mapping functions {gi}Ki=1, eachof which maps the state representation of certaindomain to a consensual latent space ZS , i.e.: ∀ i ∈{K},∃ gi : Si → ZS , and g−1

i : ZS → Si.• A set of invertible mapping functions {fi}Ki=1, each

of which maps the action representation of certaindomain to a consensual latent space ZA, i.e.: ∀ i ∈{K},∃ fi : Ai → ZA, and f−1

i : ZA → Ai.• A policy π : ZS → ZA and a constant ε, such that

π is ε-optimal on the common feature space for anyconsidered domainMi ∈M. i.e.:

∃ π : ZS → ZA, ε ≥ 0, s.t. ∀Mi ∈M,∣∣V πMi

− V ∗Mi

∣∣ ≤ ε,where VM∗i is the optimal policy for domainMi.

Evaluating Transferability: Evaluation metrics have beenproposed to evaluate TL approaches from different butcomplementary perspectives, although no single metric cansummarize the efficacy of a TL approach. Designing a set ofgeneralized, novel metrics is beneficial for the developmentof TL in DRL domain. In addition to the current benchmarks,such as OpenAI gym that is designed purely for evaluatingRL approaches, a unified benchmark to evaluate the TLperformance is also worth research and engineering efforts.

Framework-agnostic Transfer: Most contemporary TLapproaches are designed for certain RL frameworks. Forexample, some TL methods are applicable to RL algorithmsdesigned for the discrete-action space (such as DQfD), whileothers may only be feasible given a continuous action space.One fundamental cause of these framework-dependent TLmethods is the diversified development of RL algorithms.We expect that a more unified RL framework would in turncontribute to the standardization of TL approaches in thisfield.

Interpretability: Deep learning and end-to-end systemshave made network representation a black-box, making itdifficult to interpret or debug the model’s representations ordecisions. As a result, there have been efforts in defining andevaluating the interpretability of approaches in the supervise-learning domain [174, 175, 176]. The merits of interpretabilityare manifolds, including enabling disentangled represen-tations, building explainable models, facilitating human-computer-interactions, etc. At the meantime, interpretable TLapproaches for the RL domain, especially with explainablerepresentations or policy decisions, can also be beneficial tomany applied fields, such as robotics and finance. Moreover,

interpretability can also help in avoiding catastrophic decision-making for tasks such as auto-driving or healthcare decisions.Although there has been efforts towards explainable TLapproaches for RL tasks [177, 178], there is no clear definitionfor interpretable TL in the context of RL, nor a frameworkto evaluate the interpretability of TL approaches. We believethat the standardization of interpretable TL for RL will be atopic that is worth more research efforts in the near future.

REFERENCES

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: Anintroduction. MIT press, 2018.

[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,V. Panneershelvam, M. Lanctot et al., “Mastering thegame of go with deep neural networks and tree search,”nature, vol. 529, no. 7587, p. 484, 2016.

[3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,A. Bolton et al., “Mastering the game of go withouthuman knowledge,” Nature, vol. 550, no. 7676, p. 354,2017.

[4] K. Arulkumaran, M. P. Deisenroth, M. Brundage, andA. A. Bharath, “A brief survey of deep reinforcementlearning,” arXiv preprint arXiv:1708.05866, 2017.

[5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journalof Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.

[6] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, andD. Quillen, “Learning hand-eye coordination forrobotic grasping with deep learning and large-scaledata collection,” The International Journal of RoboticsResearch, vol. 37, no. 4-5, pp. 421–436, 2018.

[7] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling,“The arcade learning environment: An evaluation plat-form for general agents,” Journal of Artificial IntelligenceResearch, vol. 47, pp. 253–279, 2013.

[8] M. R. Kosorok and E. E. Moodie, Adaptive Treat-mentStrategies in Practice: Planning Trials and AnalyzingData for Personalized Medicine. SIAM, 2015, vol. 21.

[9] M. Glavic, R. Fonteneau, and D. Ernst, “Reinforce-ment learning for electric power system decision andcontrol: Past considerations and perspectives,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 6918–6927, 2017.

18

[10] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad,“Multiagent reinforcement learning for integrated net-work of adaptive traffic signal controllers (marlin-atsc):methodology and large-scale application on downtowntoronto,” IEEE Transactions on Intelligent TransportationSystems, vol. 14, no. 3, pp. 1140–1150, 2013.

[11] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: Areinforcement learning approach for intelligent trafficlight control,” in Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & DataMining. ACM, 2018, pp. 2496–2505.

[12] S. J. Pan and Q. Yang, “A survey on transfer learning,”IEEE Transactions on knowledge and data engineering,vol. 22, no. 10, pp. 1345–1359, 2009.

[13] M. E. Taylor and P. Stone, “Transfer learning forreinforcement learning domains: A survey,” Journalof Machine Learning Research, vol. 10, no. Jul, pp. 1633–1685, 2009.

[14] A. Lazaric, “Transfer in reinforcement learning: aframework and a survey,” in Reinforcement Learning.Springer, 2012, pp. 143–173.

[15] R. Bellman, “A markovian decision process,” Journal ofmathematics and mechanics, pp. 679–684, 1957.

[16] G. A. Rummery and M. Niranjan, On-line Q-learningusing connectionist systems. University of Cambridge,Department of Engineering Cambridge, England, 1994,vol. 37.

[17] H. Van Seijen, H. Van Hasselt, S. Whiteson, andM. Wiering, “A theoretical and empirical analysis ofexpected sarsa,” in 2009 IEEE Symposium on Adap-tive Dynamic Programming and Reinforcement Learning.IEEE, 2009, pp. 177–184.

[18] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algo-rithms,” in Advances in neural information processingsystems, 2000, pp. 1008–1014.

[19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap,T. Harley, D. Silver, and K. Kavukcuoglu, “Asyn-chronous methods for deep reinforcement learning,”in International conference on machine learning, 2016, pp.1928–1937.

[20] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine,“Soft actor-critic: Off-policy maximum entropy deepreinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning. PMLR,2018, pp. 1861–1870.

[21] C. J. Watkins and P. Dayan, “Q-learning,” Machinelearning, vol. 8, no. 3-4, pp. 279–292, 1992.

[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,A. K. Fidjeland, G. Ostrovski et al., “Human-levelcontrol through deep reinforcement learning,” Nature,vol. 518, no. 7540, p. 529, 2015.

[23] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul,G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar,and D. Silver, “Rainbow: Combining improvementsin deep reinforcement learning,” in Proceedings of theAAAI Conference on Artificial Intelligence, vol. 32, no. 1,2018.

[24] R. J. Williams, “Simple statistical gradient-followingalgorithms for connectionist reinforcement learning,”Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.

[25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, andP. Moritz, “Trust region policy optimization,” in In-ternational conference on machine learning, 2015, pp. 1889–1897.

[26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017.

[27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra,and M. Riedmiller, “Deterministic policy gradientalgorithms,” 2014.

[28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,Y. Tassa, D. Silver, and D. Wierstra, “Continuous con-trol with deep reinforcement learning,” arXiv preprintarXiv:1509.02971, 2015.

[29] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressingfunction approximation error in actor-critic methods,”arXiv preprint arXiv:1802.09477, 2018.

[30] J. Ho and S. Ermon, “Generative adversarial imitationlearning,” in Advances in neural information processingsystems, 2016, pp. 4565–4573.

[31] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policyimitation learning from observations,” in Advances inNeural Information Processing Systems, vol. 33, 2020, pp.12 402–12 413.

[32] I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, andJ. Tompson, “Discriminator-actor-critic: Addressingsample inefficiency and reward bias in adversarialimitation learning,” arXiv preprint arXiv:1809.02925,2018.

[33] D. A. Pomerleau, “Efficient training of artificial neuralnetworks for autonomous navigation,” Neural Compu-tation, vol. 3, no. 1, pp. 88–97, 1991.

[34] A. Y. Ng, S. J. Russell et al., “Algorithms for inversereinforcement learning.” in Icml, vol. 1, 2000, p. 2.

[35] M. Jing, X. Ma, W. Huang, F. Sun, C. Yang, B. Fang,and H. Liu, “Reinforcement learning from imperfectdemonstrations under soft expert guidance.” in AAAI,2020, pp. 5109–5116.

[36] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Learning sparserewarded tasks from sub-optimal demonstrations,”arXiv preprint arXiv:2004.00530, 2020.

[37] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, andS. Wermter, “Continual lifelong learning with neuralnetworks: A review,” Neural Networks, 2019.

[38] R. S. Sutton, A. Koop, and D. Silver, “On the role oftracking in stationary environments,” in Proceedingsof the 24th international conference on Machine learning.ACM, 2007, pp. 871–878.

[39] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever,I. Mordatch, and P. Abbeel, “Continuous adaptationvia meta-learning in nonstationary and competitiveenvironments,” ICLR, 2018.

[40] C. Finn, P. Abbeel, and S. Levine, “Model-agnosticmeta-learning for fast adaptation of deep networks,”in Proceedings of the 34th International Conference onMachine Learning-Volume 70. JMLR. org, 2017, pp.1126–1135.

[41] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learn-ing to detect unseen object classes by between-classattribute transfer,” in 2009 IEEE Conference on ComputerVision and Pattern Recognition. IEEE, 2009, pp. 951–958.

19

[42] P. Dayan and G. E. Hinton, “Feudal reinforcementlearning,” in Advances in neural information processingsystems, 1993, pp. 271–278.

[43] R. S. Sutton, D. Precup, and S. Singh, “Between mdpsand semi-mdps: A framework for temporal abstractionin reinforcement learning,” Artificial intelligence, vol.112, no. 1-2, pp. 181–211, 1999.

[44] R. Parr and S. J. Russell, “Reinforcement learningwith hierarchies of machines,” in Advances in neuralinformation processing systems, 1998, pp. 1043–1049.

[45] T. G. Dietterich, “Hierarchical reinforcement learningwith the maxq value function decomposition,” Journalof artificial intelligence research, vol. 13, pp. 227–303, 2000.

[46] R. B. Myerson, Game theory. Harvard university press,2013.

[47] L. Bu, R. Babu, B. De Schutter et al., “A comprehensivesurvey of multiagent reinforcement learning,” IEEETransactions on Systems, Man, and Cybernetics, Part C(Applications and Reviews), vol. 38, no. 2, pp. 156–172,2008.

[48] M. Tan, “Multi-agent reinforcement learning: Indepen-dent vs. cooperative agents,” in Proceedings of the tenthinternational conference on machine learning, 1993, pp.330–337.

[49] F. L. Da Silva and A. H. R. Costa, “A survey on transferlearning for multiagent reinforcement learning sys-tems,” Journal of Artificial Intelligence Research, vol. 64,pp. 645–703, 2019.

[50] B. Kim, A.-m. Farahmand, J. Pineau, and D. Precup,“Learning from limited demonstrations,” in Advancesin Neural Information Processing Systems, 2013, pp. 2859–2867.

[51] W. Czarnecki, R. Pascanu, S. Osindero, S. Jayakumar,G. Swirszcz, and M. Jaderberg, “Distilling policydistillation,” in The 22nd International Conference onArtificial Intelligence and Statistics, 2019.

[52] A. Y. Ng, D. Harada, and S. Russell, “Policy invarianceunder reward transformations: Theory and applicationto reward shaping,” in ICML, vol. 99, 1999, pp. 278–287.

[53] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“Generative adversarial nets,” in Advances in neuralinformation processing systems, 2014, pp. 2672–2680.

[54] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Uni-versal value function approximators,” in InternationalConference on Machine Learning, 2015, pp. 1312–1320.

[55] C. Finn and S. Levine, “Meta-learning: from few-shotlearning to rapid reinforcement learning,” in ICML,2019.

[56] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner,S. Bohez, and V. Vanhoucke, “Sim-to-real: Learningagile locomotion for quadruped robots,” arXiv preprintarXiv:1804.10332, 2018.

[57] Y. Gao, J. Lin, F. Yu, S. Levine, T. Darrell et al., “Re-inforcement learning from imperfect demonstrations,”arXiv preprint arXiv:1802.05313, 2018.

[58] M. E. Taylor, P. Stone, and Y. Liu, “Transfer learning viainter-task mappings for temporal difference learning,”Journal of Machine Learning Research, vol. 8, no. Sep, pp.2125–2167, 2007.

[59] J. Randløv and P. Alstrøm, “Learning to drive a bicycle

using reinforcement learning and shaping.” in ICML,vol. 98, 1998, pp. 463–471.

[60] R. J. Williams and L. C. Baird, “Tight performancebounds on greedy policies based on imperfect valuefunctions,” Citeseer, Tech. Rep., 1993.

[61] E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principledmethods for advising reinforcement learning agents,”in Proceedings of the 20th International Conference onMachine Learning (ICML-03), 2003, pp. 792–799.

[62] S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping,” in Proceedings of the 11th Inter-national Conference on Autonomous Agents and MultiagentSystems. IFAAMAS, 2012, pp. 433–440.

[63] A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowe,“Expressing arbitrary reward functions as potential-based advice,” in Twenty-Ninth AAAI Conference onArtificial Intelligence, 2015.

[64] T. Brys, A. Harutyunyan, M. E. Taylor, and A. Nowe,“Policy transfer using reward shaping,” in Proceedings ofthe 2015 International Conference on Autonomous Agentsand Multiagent Systems. International Foundation forAutonomous Agents and Multiagent Systems, 2015,pp. 181–188.

[65] M. Vecerık, T. Hester, J. Scholz, F. Wang, O. Pietquin,B. Piot, N. Heess, T. Rothorl, T. Lampe, and M. Ried-miller, “Leveraging demonstrations for deep rein-forcement learning on robotics problems with sparserewards,” arXiv preprint arXiv:1707.08817, 2017.

[66] S. Devlin, L. Yliniemi, D. Kudenko, and K. Tumer,“Potential-based difference rewards for multiagentreinforcement learning,” in Proceedings of the 2014 inter-national conference on Autonomous agents and multi-agentsystems. International Foundation for AutonomousAgents and Multiagent Systems, 2014, pp. 165–172.

[67] M. Grzes and D. Kudenko, “Learning shaping rewardsin model-based reinforcement learning,” in Proc. AA-MAS 2009 Workshop on Adaptive Learning Agents, vol.115, 2009.

[68] Y. Gao and F. Toni, “Potential based reward shaping forhierarchical reinforcement learning,” in Twenty-FourthInternational Joint Conference on Artificial Intelligence,2015.

[69] O. Marom and B. Rosman, “Belief reward shapingin reinforcement learning,” in Thirty-Second AAAIConference on Artificial Intelligence, 2018.

[70] A. C. Tenorio-Gonzalez, E. F. Morales, and L. Vil-lasenor-Pineda, “Dynamic reward shaping: Traininga robot by voice,” in Advances in Artificial Intelligence –IBERAMIA 2010. Berlin, Heidelberg: Springer BerlinHeidelberg, 2010, pp. 483–492.

[71] P.-H. Su, D. Vandyke, M. Gasic, N. Mrksic, T.-H.Wen, and S. Young, “Reward shaping with recur-rent neural networks for speeding up on-line policylearning in spoken dialogue systems,” arXiv preprintarXiv:1508.03391, 2015.

[72] X. V. Lin, R. Socher, and C. Xiong, “Multi-hop knowl-edge graph reasoning with reward shaping,” arXivpreprint arXiv:1808.10568, 2018.

[73] F. Liu, Z. Ling, T. Mu, and H. Su, “Statealignment-based imitation learning,” arXiv preprintarXiv:1911.10947, 2019.

20

[74] X. Zhang and H. Ma, “Pretraining deep actor-criticreinforcement learning algorithms with expert demon-strations,” arXiv preprint arXiv:1801.10459, 2018.

[75] S. Schaal, “Learning from demonstration,” in Advancesin neural information processing systems, 1997, pp. 1040–1046.

[76] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul,B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osbandet al., “Deep q-learning from demonstrations,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[77] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba,and P. Abbeel, “Overcoming exploration in reinforce-ment learning with demonstrations,” in 2018 IEEEInternational Conference on Robotics and Automation(ICRA). IEEE, 2018, pp. 6292–6299.

[78] J. Chemali and A. Lazaric, “Direct policy iteration withdemonstrations,” in Twenty-Fourth International JointConference on Artificial Intelligence, 2015.

[79] B. Piot, M. Geist, and O. Pietquin, “Boosted bellmanresidual minimization handling expert demonstra-tions,” in Joint European Conference on Machine Learningand Knowledge Discovery in Databases. Springer, 2014,pp. 549–564.

[80] T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E.Taylor, and A. Nowe, “Reinforcement learning fromdemonstration through shaping,” in Twenty-FourthInternational Joint Conference on Artificial Intelligence,2015.

[81] B. Kang, Z. Jie, and J. Feng, “Policy optimization withdemonstrations,” in International Conference on MachineLearning, 2018, pp. 2474–2483.

[82] D. P. Bertsekas, “Approximate policy iteration: Asurvey and some new methods,” Journal of ControlTheory and Applications, vol. 9, no. 3, pp. 310–335, 2011.

[83] T. Schaul, J. Quan, I. Antonoglou, and D. Silver,“Prioritized experience replay,” in ICLR, 2016.

[84] S. Ross, G. Gordon, and D. Bagnell, “A reduction ofimitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenthinternational conference on artificial intelligence and statis-tics, 2011, pp. 627–635.

[85] K. Brantley, W. Sun, and M. Henaff, “Disagreement-regularized imitation learning,” in International Confer-ence on Learning Representations, 2019.

[86] G. Hinton, O. Vinyals, and J. Dean, “Distilling theknowledge in a neural network,” Deep Learning andRepresentation Learning Workshop, NIPS, 2014.

[87] A. Polino, R. Pascanu, and D. Alistarh, “Modelcompression via distillation and quantization,” arXivpreprint arXiv:1802.05668, 2018.

[88] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre,G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih,K. Kavukcuoglu, and R. Hadsell, “Policy distillation,”arXiv preprint arXiv:1511.06295, 2015.

[89] H. Yin and S. J. Pan, “Knowledge transfer for deepreinforcement learning with hierarchical experiencereplay,” in Thirty-First AAAI Conference on ArtificialIntelligence, 2017.

[90] E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-mimic: Deep multitask and transfer reinforcementlearning,” arXiv preprint arXiv:1511.06342, 2015.

[91] S. Schmitt, J. J. Hudson, A. Zidek, S. Osindero, C. Do-ersch, W. M. Czarnecki, J. Z. Leibo, H. Kuttler, A. Zis-serman, K. Simonyan et al., “Kickstarting deep rein-forcement learning,” arXiv preprint arXiv:1803.03835,2018.

[92] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirk-patrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral:Robust multitask reinforcement learning,” in Advancesin Neural Information Processing Systems, 2017, pp. 4496–4506.

[93] J. Schulman, X. Chen, and P. Abbeel, “Equivalencebetween policy gradients and soft q-learning,” arXivpreprint arXiv:1704.06440, 2017.

[94] F. Fernandez and M. Veloso, “Probabilistic policy reusein a reinforcement learning agent,” in Proceedings of thefifth international joint conference on Autonomous agentsand multiagent systems. ACM, 2006, pp. 720–727.

[95] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul,H. P. van Hasselt, and D. Silver, “Successor featuresfor transfer in reinforcement learning,” in Advancesin neural information processing systems, 2017, pp. 4055–4065.

[96] R. Bellman, “Dynamic programming,” Science, vol. 153,no. 3731, pp. 34–37, 1966.

[97] L. Torrey, T. Walker, J. Shavlik, and R. Maclin, “Usingadvice to transfer knowledge acquired in one reinforce-ment learning task to another,” in European Conferenceon Machine Learning. Springer, 2005, pp. 412–424.

[98] A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine,“Learning invariant feature spaces to transfer skillswith reinforcement learning,” International Conferenceon Learning Representations (ICLR), 2017.

[99] G. Konidaris and A. Barto, “Autonomous shaping:Knowledge transfer in reinforcement learning,” inProceedings of the 23rd international conference on Machinelearning. ACM, 2006, pp. 489–496.

[100] H. B. Ammar and M. E. Taylor, “Reinforcementlearning transfer via common subspaces,” inProceedings of the 11th International Conferenceon Adaptive and Learning Agents, ser. ALA’11.Berlin, Heidelberg: Springer-Verlag, 2012, pp. 21–36. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-28499-1 2

[101] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet:A deep convolutional encoder-decoder architecturefor image segmentation,” IEEE transactions on patternanalysis and machine intelligence, vol. 39, no. 12, pp.2481–2495, 2017.

[102] C. Wang and S. Mahadevan, “Manifold alignmentwithout correspondence,” in Twenty-First InternationalJoint Conference on Artificial Intelligence, 2009.

[103] B. Bocsi, L. Csato, and J. Peters, “Alignment-basedtransfer learning for robot models,” in The 2013 Inter-national Joint Conference on Neural Networks (IJCNN).IEEE, 2013, pp. 1–7.

[104] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor,“Unsupervised cross-domain transfer in policy gradientreinforcement learning via manifold alignment,” inTwenty-Ninth AAAI Conference on Artificial Intelligence,2015.

[105] H. B. Ammar, K. Tuyls, M. E. Taylor, K. Driessens, and

http://dx.doi.org/10.1007/978-3-642-28499-1_2

http://dx.doi.org/10.1007/978-3-642-28499-1_2

21

G. Weiss, “Reinforcement learning transfer via sparsecoding,” in Proceedings of the 11th International Con-ference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for AutonomousAgents and Multiagent Systems, 2012, pp. 383–390.

[106] A. Lazaric, M. Restelli, and A. Bonarini, “Transferof samples in batch reinforcement learning,” in Pro-ceedings of the 25th international conference on Machinelearning. ACM, 2008, pp. 544–551.

[107] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer,J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, andR. Hadsell, “Progressive neural networks,” arXivpreprint arXiv:1606.04671, 2016.

[108] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha,A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet:Evolution channels gradient descent in super neuralnetworks,” arXiv preprint arXiv:1701.08734, 2017.

[109] I. Harvey, “The microbial genetic algorithm,” in Euro-pean Conference on Artificial Life. Springer, 2009, pp.126–133.

[110] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine,“Learning modular neural network policies for multi-task and multi-robot transfer,” in 2017 IEEE Interna-tional Conference on Robotics and Automation (ICRA).IEEE, 2017, pp. 2169–2176.

[111] J. Andreas, D. Klein, and S. Levine, “Modular multitaskreinforcement learning with policy sketches,” in Pro-ceedings of the 34th International Conference on MachineLearning-Volume 70. JMLR. org, 2017, pp. 166–175.

[112] A. Zhang, H. Satija, and J. Pineau, “Decoupling dy-namics and reward for transfer learning,” arXiv preprintarXiv:1804.10689, 2018.

[113] P. Dayan, “Improving generalization for temporal dif-ference learning: The successor representation,” NeuralComputation, vol. 5, no. 4, pp. 613–624, 1993.

[114] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gersh-man, “Deep successor reinforcement learning,” arXivpreprint arXiv:1606.02396, 2016.

[115] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Bur-gard, “Deep reinforcement learning with successorfeatures for navigation across similar environments,”in 2017 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS). IEEE, 2017, pp. 2371–2378.

[116] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver,M. Hessel, D. Mankowitz, A. Zıdek, and R. Munos,“Transfer in deep reinforcement learning using suc-cessor features and generalised policy improvement,”arXiv preprint arXiv:1901.10964, 2019.

[117] N. Mehta, S. Natarajan, P. Tadepalli, and A. Fern,“Transfer in variable-reward hierarchical reinforcementlearning,” Machine Learning, vol. 73, no. 3, p. 289, 2008.

[118] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorizationtechniques for recommender systems,” Computer, no. 8,pp. 30–37, 2009.

[119] R. H. Keshavan, A. Montanari, and S. Oh, “Matrixcompletion from a few entries,” IEEE transactions oninformation theory, vol. 56, no. 6, pp. 2980–2998, 2010.

[120] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos,H. van Hasselt, D. Silver, and T. Schaul, “Univer-sal successor features approximators,” arXiv preprintarXiv:1812.07626, 2018.

[121] L. Lehnert, S. Tellex, and M. L. Littman, “Advan-tages and limitations of using successor features fortransfer in reinforcement learning,” arXiv preprintarXiv:1708.00102, 2017.

[122] J. C. Petangoda, S. Pascual-Diaz, V. Adam, P. Vrancx,and J. Grau-Moya, “Disentangled skill embed-dings for reinforcement learning,” arXiv preprintarXiv:1906.09223, 2019.

[123] B. Zadrozny, “Learning and evaluating classifiersunder sample selection bias,” in Proceedings of thetwenty-first international conference on Machine learning.ACM, 2004, p. 114.

[124] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcementlearning in robotics: A survey,” The International Journalof Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.

[125] B. D. Argall, S. Chernova, M. Veloso, and B. Browning,“A survey of robot learning from demonstration,”Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.

[126] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “Asurvey of research on cloud robotics and automation,”IEEE Transactions on automation science and engineering,vol. 12, no. 2, pp. 398–409, 2015.

[127] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deepreinforcement learning for robotic manipulation withasynchronous off-policy updates,” in 2017 IEEE inter-national conference on robotics and automation (ICRA).IEEE, 2017, pp. 3389–3396.

[128] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine,“Epopt: Learning robust neural network policies usingmodel ensembles,” arXiv preprint arXiv:1610.01283,2016.

[129] W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing forthe unknown: Learning a universal policy with onlinesystem identification,” arXiv preprint arXiv:1702.02453,2017.

[130] F. Sadeghi and S. Levine, “Cad2rl: Real single-imageflight without a single real image,” arXiv preprintarXiv:1611.04201, 2016.

[131] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey,M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Kono-lige et al., “Using simulation and domain adaptation toimprove efficiency of deep robotic grasping,” in 2018IEEE International Conference on Robotics and Automation(ICRA). IEEE, 2018, pp. 4243–4250.

[132] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull, “Adata-efficient framework for training and sim-to-realtransfer of navigation policies,” in 2019 InternationalConference on Robotics and Automation (ICRA). IEEE,2019, pp. 782–788.

[133] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess,A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner,“Darla: Improving zero-shot transfer in reinforcementlearning,” in Proceedings of the 34th International Confer-ence on Machine Learning-Volume 70. JMLR. org, 2017,pp. 1480–1490.

[134] J. Oh, V. Chockalingam, S. Singh, and H. Lee, “Controlof memory, active perception, and action in minecraft,”arXiv preprint arXiv:1605.09128, 2016.

[135] S. Ontanon, G. Synnaeve, A. Uriarte, F. Richoux,D. Churchill, and M. Preuss, “A survey of real-time

22

strategy game ai research and competition in starcraft,”IEEE Transactions on Computational Intelligence and AI ingames, vol. 5, no. 4, pp. 293–311, 2013.

[136] Z. Lin, J. Gehring, V. Khalidov, and G. Synnaeve,“Stardata: A starcraft ai research dataset,” in ThirteenthArtificial Intelligence and Interactive Digital EntertainmentConference, 2017.

[137] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deeplearning for video game playing,” IEEE Transactions onGames, 2019.

[138] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playingatari with deep reinforcement learning,” arXiv preprintarXiv:1312.5602, 2013.

[139] OpenAI. (2019) Dotal2 blog. [Online]. Available:https://openai.com/blog/openai-five/

[140] H. Chen, X. Liu, D. Yin, and J. Tang, “A survey ondialogue systems: Recent advances and new frontiers,”Acm Sigkdd Explorations Newsletter, vol. 19, no. 2, pp.25–35, 2017.

[141] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker,“Reinforcement learning for spoken dialogue systems,”in Advances in Neural Information Processing Systems,2000, pp. 956–962.

[142] B. Zoph and Q. V. Le, “Neural architecturesearch with reinforcement learning,” arXiv preprintarXiv:1611.01578, 2016.

[143] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, andK. Saenko, “Learning to reason: End-to-end modulenetworks for visual question answering,” in Proceedingsof the IEEE International Conference on Computer Vision,2017, pp. 804–813.

[144] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li,“Deep reinforcement learning-based image captioningwith embedding reward,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,2017, pp. 290–298.

[145] T. Young, D. Hazarika, S. Poria, and E. Cambria,“Recent trends in deep learning based natural languageprocessing,” ieee Computational intelligenCe magazine,vol. 13, no. 3, pp. 55–75, 2018.

[146] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein,“Learning to compose neural networks for questionanswering,” arXiv preprint arXiv:1601.01705, 2016.

[147] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe,J. Pineau, A. Courville, and Y. Bengio, “An actor-critic algorithm for sequence prediction,” arXiv preprintarXiv:1607.07086, 2016.

[148] F. Godin, A. Kumar, and A. Mittal, “Learning when notto answer: a ternary reward structure for reinforcementlearning based question answering,” in Proceedingsof the 2019 Conference of the North American Chapterof the Association for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Industry Papers), 2019,pp. 122–129.

[149] K.-W. Chang, A. Krishnamurthy, A. Agarwal, J. Lang-ford, and H. Daume III, “Learning to search better thanyour teacher,” 2015.

[150] J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra,“Best of both worlds: Transferring knowledge fromdiscriminative learning to a generative visual dialog

model,” in Advances in Neural Information ProcessingSystems, 2017, pp. 314–324.

[151] E. B. Laber, D. J. Lizotte, M. Qian, W. E. Pelham, andS. A. Murphy, “Dynamic treatment regimes: Techni-cal challenges and applications,” Electronic journal ofstatistics, vol. 8, no. 1, p. 1225, 2014.

[152] M. Tenenbaum, A. Fern, L. Getoor, M. Littman, V. Man-asinghka, S. Natarajan, D. Page, J. Shrager, Y. Singer,and P. Tadepalli, “Personalizing cancer therapy viamachine learning,” in Workshops of NIPS, 2010.

[153] A. Alansary, O. Oktay, Y. Li, L. Le Folgoc, B. Hou,G. Vaillant, K. Kamnitsas, A. Vlontzos, B. Glocker,B. Kainz et al., “Evaluating reinforcement learningagents for anatomical landmark detection,” Medicalimage analysis, vol. 53, pp. 156–164, 2019.

[154] K. Ma, J. Wang, V. Singh, B. Tamersoy, Y.-J. Chang,A. Wimmer, and T. Chen, “Multimodal image regis-tration with deep context reinforcement learning,” inInternational Conference on Medical Image Computing andComputer-Assisted Intervention. Springer, 2017, pp. 240–248.

[155] Z. Huang, W. M. van der Aalst, X. Lu, and H. Duan,“Reinforcement learning based resource allocation inbusiness process management,” Data & KnowledgeEngineering, vol. 70, no. 1, pp. 127–145, 2011.

[156] T. S. M. T. Gomes, “Reinforcement learning for primarycare e appointment scheduling,” 2017.

[157] A. Serrano, B. Imbernon, H. Perez-Sanchez, J. M. Ce-cilia, A. Bueno-Crespo, and J. L. Abellan, “Acceleratingdrugs discovery with deep reinforcement learning: Anearly approach,” in Proceedings of the 47th InternationalConference on Parallel Processing Companion. ACM,2018, p. 6.

[158] M. Popova, O. Isayev, and A. Tropsha, “Deep rein-forcement learning for de novo drug design,” Scienceadvances, vol. 4, no. 7, p. eaap7885, 2018.

[159] C. Yu, J. Liu, and S. Nemati, “Reinforcement learning inhealthcare: A survey,” arXiv preprint arXiv:1908.08796,2019.

[160] A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A.Jacobs, J. M. Zurada, and M. E. Brier, “Incorporatingprior knowledge into q-learning for drug deliveryindividualization,” in Fourth International Conference onMachine Learning and Applications (ICMLA’05). IEEE,2005, pp. 6–pp.

[161] V. N. Marivate, J. Chemali, E. Brunskill, andM. Littman, “Quantifying uncertainty in batch person-alized sequential decision making,” in Workshops at theTwenty-Eighth AAAI Conference on Artificial Intelligence,2014.

[162] T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi-Velez, “Robust and efficient transfer learning withhidden parameter markov decision processes,” inAdvances in Neural Information Processing Systems, 2017,pp. 6250–6261.

[163] A. Holzinger, “Interactive machine learning for healthinformatics: when do we need the human-in-the-loop?”Brain Informatics, vol. 3, no. 2, pp. 119–131, 2016.

[164] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timingvia deep reinforcement learning,” IEEE/CAA Journal ofAutomatica Sinica, vol. 3, no. 3, pp. 247–254, 2016.

https://openai.com/blog/openai-five/

23

[165] K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient large-scale fleet management via multi-agent deep reinforce-ment learning,” in Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & DataMining. ACM, 2018, pp. 1774–1783.

[166] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, andP. Komisarczuk, “A survey on reinforcement learningmodels and algorithms for traffic signal control,” ACMComputing Surveys (CSUR), vol. 50, no. 3, p. 34, 2017.

[167] J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performancefunctions and reinforcement learning for trading sys-tems and portfolios,” Journal of Forecasting, vol. 17, no.5-6, pp. 441–470, 1998.

[168] Z. Jiang and J. Liang, “Cryptocurrency portfolio man-agement with deep reinforcement learning,” in 2017Intelligent Systems Conference (IntelliSys). IEEE, 2017,pp. 905–913.

[169] R. Neuneier, “Enhancing q-learning for optimal assetallocation,” in Advances in neural information processingsystems, 1998, pp. 936–942.

[170] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deepdirect reinforcement learning for financial signal rep-resentation and trading,” IEEE transactions on neuralnetworks and learning systems, vol. 28, no. 3, pp. 653–664,2016.

[171] G. Dalal, E. Gilboa, and S. Mannor, “Hierarchicaldecision making in electricity grid management,” inInternational Conference on Machine Learning, 2016, pp.2197–2206.

[172] F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter,R. Babuska, and R. Belmans, “Residential demandresponse of thermostatically controlled loads usingbatch reinforcement learning,” IEEE Transactions onSmart Grid, vol. 8, no. 5, pp. 2149–2159, 2016.

[173] Z. Wen, D. O’Neill, and H. Maei, “Optimal demandresponse using device-based reinforcement learning,”IEEE Transactions on Smart Grid, vol. 6, no. 5, pp. 2312–2324, 2015.

[174] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter,and L. Kagal, “Explaining explanations: An overviewof interpretability of machine learning,” in 2018 IEEE5th International Conference on data science and advancedanalytics (DSAA). IEEE, 2018, pp. 80–89.

[175] Q.-s. Zhang and S.-C. Zhu, “Visual interpretabilityfor deep learning: a survey,” Frontiers of InformationTechnology & Electronic Engineering, vol. 19, no. 1, pp.27–39, 2018.

[176] Y. Dong, H. Su, J. Zhu, and B. Zhang, “Improvinginterpretability of deep neural networks with semanticinformation,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2017, pp. 4306–4314.

[177] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretableimitation learning from visual demonstrations,” inAdvances in Neural Information Processing Systems, 2017,pp. 3812–3822.

[178] R. Ramakrishnan and J. Shah, “Towards interpretableexplanations for transfer learning in sequential tasks,”in 2016 AAAI Spring Symposium Series, 2016.

Documents

1 Transfer Learning in Deep Reinforcement Learning: A Survey