12
Deep Reinforcement Learning with Decorrelation Borislav Mavrin 1 Hengshuai Yao 2 Linglong Kong 12 Abstract Learning an effective representation for high- dimensional data is a challenging problem in re- inforcement learning (RL). Deep reinforcement learning (DRL) such as Deep Q networks (DQN) achieves remarkable success in computer games by learning deeply encoded representation from convolution networks. In this paper, we propose a simple yet very effective method for represen- tation learning with DRL algorithms. Our key insight is that features learned by DRL algorithms are highly correlated, which interferes with learn- ing. By adding a regularized loss that penalizes correlation in latent features (with only slight com- putation), we decorrelate features represented by deep neural networks incrementally. On 49 Atari games, with the same regularization factor, our decorrelation algorithms perform 70% in terms of human-normalized scores, which is 40% bet- ter than DQN. In particular, ours performs better than DQN on 39 games with 4 close ties and lost only slightly on 6 games. Empirical results also show that the decorrelation method applies to Quantile Regression DQN (QR-DQN) and signif- icantly boosts performance. Further experiments on the losing games show that our decorelation algorithms can win over DQN and QR-DQN with a fined tuned regularization factor. 1. Introduction Since the early days of artificial intelligence, learning ef- fective representation of states has been one focus of re- search, especially because state representation is impor- tant for algorithms in practice. For example, reinforcement learning(RL)-based algorithms for computer Go used to rely on millions of binary features constructed from local shapes that recognize certain important geometry for the board. 1 Department of Computing Science, University of Alberta, Edmonton, Canada 2 Huawei Noah’s Ark, Edmonton. Correspon- dence to: Hengshuai Yao <[email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Some features were even constructed by the Go commu- nity from interviewing top Go professionals (Silver, 2009). In this paper, we study state representation for RL. There has been numerous works in learning a good representation in RL with linear function approximation. Tile coding is a classical binary scheme for encoding states and general- izes in local sub-spaces (Albus, 1975). Parr et al. (2008) constructed representation from sampled Bellman error to reduce policy evaluation error. Petrik (2007) computed pow- ers of certain transition matrix in the underlying Markov Decision Processes (MDPs) to represent states. Konidaris et al. (2011) used Fourier basis functions to construct state representation; and so on. To the best of our knowledge, there are two approaches for state representation learning in DRL. The first is using aux- iliary tasks. For example, UNREAL is an architecture that learns a universal representation by presenting the learning agent auxiliary tasks (each of which has a pseudo reward) that are relevant to the main task (which has an extrinsic reward) (Jaderberg et al., 2016); with a goal of solving all the tasks from a shared representation. The second is a two- phase approach, which first learns a good representation and then performs learning and control using the representation. The motivation for this line of work is to leverage neural networks for RL algorithms without using experience replay. It is well-known that neural networks has a “catastrophic interference” issue: the networks can forget what it has been trained on in the past, which is exactly the motivation for experience replay in DRL (using a buffer to store experi- ence and replaying experience later). Ghiassian et al. (2018) proposed to apply input transformation to reduce input inter- ference caused by ReLU gates. Liu et al. (2018) proposed to learn sparse representation for features which can be used for online algorithms such as Sarsa. While we also deal with representation learning with neural networks for RL, our motivation is to learn effective repre- sentation in a one-phase process, simultaneously acquiring a good representation and solving control. We also study the generic one-task setting, although our method may be applicable to multi-task control. Our insight in this paper is that feature correlation phenomenon happens frequently in DRL algorithms: features learned by deep neural networks are highly correlated measured by covariances. We empir- ically show that the correlation causes DRL algorithms to arXiv:1903.07765v1 [cs.LG] 18 Mar 2019

Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

Borislav Mavrin 1 Hengshuai Yao 2 Linglong Kong 1 2

AbstractLearning an effective representation for high-dimensional data is a challenging problem in re-inforcement learning (RL). Deep reinforcementlearning (DRL) such as Deep Q networks (DQN)achieves remarkable success in computer gamesby learning deeply encoded representation fromconvolution networks. In this paper, we proposea simple yet very effective method for represen-tation learning with DRL algorithms. Our keyinsight is that features learned by DRL algorithmsare highly correlated, which interferes with learn-ing. By adding a regularized loss that penalizescorrelation in latent features (with only slight com-putation), we decorrelate features represented bydeep neural networks incrementally. On 49 Atarigames, with the same regularization factor, ourdecorrelation algorithms perform 70% in termsof human-normalized scores, which is 40% bet-ter than DQN. In particular, ours performs betterthan DQN on 39 games with 4 close ties andlost only slightly on 6 games. Empirical resultsalso show that the decorrelation method applies toQuantile Regression DQN (QR-DQN) and signif-icantly boosts performance. Further experimentson the losing games show that our decorelationalgorithms can win over DQN and QR-DQN witha fined tuned regularization factor.

1. IntroductionSince the early days of artificial intelligence, learning ef-fective representation of states has been one focus of re-search, especially because state representation is impor-tant for algorithms in practice. For example, reinforcementlearning(RL)-based algorithms for computer Go used to relyon millions of binary features constructed from local shapesthat recognize certain important geometry for the board.

1Department of Computing Science, University of Alberta,Edmonton, Canada 2Huawei Noah’s Ark, Edmonton. Correspon-dence to: Hengshuai Yao <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

Some features were even constructed by the Go commu-nity from interviewing top Go professionals (Silver, 2009).In this paper, we study state representation for RL. Therehas been numerous works in learning a good representationin RL with linear function approximation. Tile coding isa classical binary scheme for encoding states and general-izes in local sub-spaces (Albus, 1975). Parr et al. (2008)constructed representation from sampled Bellman error toreduce policy evaluation error. Petrik (2007) computed pow-ers of certain transition matrix in the underlying MarkovDecision Processes (MDPs) to represent states. Konidariset al. (2011) used Fourier basis functions to construct staterepresentation; and so on.

To the best of our knowledge, there are two approaches forstate representation learning in DRL. The first is using aux-iliary tasks. For example, UNREAL is an architecture thatlearns a universal representation by presenting the learningagent auxiliary tasks (each of which has a pseudo reward)that are relevant to the main task (which has an extrinsicreward) (Jaderberg et al., 2016); with a goal of solving allthe tasks from a shared representation. The second is a two-phase approach, which first learns a good representation andthen performs learning and control using the representation.The motivation for this line of work is to leverage neuralnetworks for RL algorithms without using experience replay.It is well-known that neural networks has a “catastrophicinterference” issue: the networks can forget what it has beentrained on in the past, which is exactly the motivation forexperience replay in DRL (using a buffer to store experi-ence and replaying experience later). Ghiassian et al. (2018)proposed to apply input transformation to reduce input inter-ference caused by ReLU gates. Liu et al. (2018) proposedto learn sparse representation for features which can be usedfor online algorithms such as Sarsa.

While we also deal with representation learning with neuralnetworks for RL, our motivation is to learn effective repre-sentation in a one-phase process, simultaneously acquiringa good representation and solving control. We also studythe generic one-task setting, although our method may beapplicable to multi-task control. Our insight in this paper isthat feature correlation phenomenon happens frequently inDRL algorithms: features learned by deep neural networksare highly correlated measured by covariances. We empir-ically show that the correlation causes DRL algorithms to

arX

iv:1

903.

0776

5v1

[cs

.LG

] 1

8 M

ar 2

019

Page 2: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

have a slow learning curve. We propose to use a regular-ized correlation loss function to automatically decorrelatestate features for Deep Q networks (DQN). Across 49 Atarigames, our decorrelation method significantly improvesDQN with only slight computation. Our decorrelation algo-rithms perform 70% in terms of human-normalized scoresand 40% better than DQN. In particular, ours performs bet-ter than DQN on 39 games with 4 close ties with only slightloss on 6 games.

To test the generalibility of our method, we also apply decor-relation to Quantile Regression DQN (QR-DQN), a recentdistributional RL algorithm that achieves generally betterperformance than DQN. Results show that the same con-clusion holds: our decorrelation method performs muchbetter than QR-DQN in terms of median human-normalizedscores. In particular, the decorrelation algorithm lost 5games, with 10 games being close (less than 5% in cumu-lative rewards), and 34 winning games (more than 5% incumulative rewards). We used the same regularization fac-tor in benchmarking for all games. We conducted furtherexperiments on losing games, and found that the loss isdue to a parameter choice for regularization factor. For thelosing games, decorrelation algorithms with other values ofregularization factor can significantly win over DQN andQR-DQN.

2. BackgroundWe consider a Markov Decision Process (MDP) of a statespace S, an action space A, a reward “function” R : S ×A → R, a transition kernel p : S × A × S → [0, 1],and a discount ratio γ ∈ [0, 1). In this paper we treat thereward “function” R as a random variable to emphasize itsstochasticity. Bandit setting is a special case of the generalRL setting, where we usually only have one state.

We use π : S × A → [0, 1] to denote a stochastic policy.We use Zπ(s, a) to denote the random variable of the sumof the discounted rewards in the future, following the policyπ and starting from the state s and the action a. We haveZπ(s, a)

.=∑∞t=0 γ

tR(St, At), where S0 = s,A0 = a andSt+1 ∼ p(·|St, At), At ∼ π(·|St). The expectation of therandom variable Zπ(s, a) is

Qπ(s, a).= Eπ,p,R[Zπ(s, a)]

which is usually called the state-action value function. Ingeneral RL setting, we are usually interested in finding anoptimal policy π∗, such that Qπ

∗(s, a) ≥ Qπ(s, a) holds

for any (π, s, a). All the possible optimal policies sharethe same optimal state-action value function Q∗, which isthe unique fixed point of the Bellman optimality operator(Bellman, 2013),

Q(s, a) = T Q(s, a).= E[R(s, a)] + γEs′∼p[max

a′Q(s′, a′)]

Based on the Bellman optimality operator, Watkins & Dayan(1992) proposed Q-learning to learn the optimal state-actionvalue function Q∗ for control. At each time step, we updateQ(s, a) as,

Q(s, a)← Q(s, a) + α(r + γmaxa′

Q(s′, a′)−Q(s, a))

where α is a step size and (s, a, r, s′) is a transition. Therehave been many work extending Q-learning to linear func-tion approximation (Sutton & Barto, 2018; Szepesvari,2010).

3. Regularized Correlation Loss3.1. DQN with Decorrelation

Mnih et al. (2015) combined Q-learning with deep neu-ral network function approximators, resulting the Deep-Q-Network (DQN). Assume the Q function is parameterizedby a network with weights θ, at each time step, DQN per-forms a stochastic gradient descent to update θ minimizingthe loss

1

2(rt+1 + γmax

aQθ−(st+1, a)−Qθ(st, at))2

where θ− is target network (Mnih et al., 2015), which isa copy of θ and is synchronized with θ periodically, and(st, at, rt+1, st+1) is a transition sampled from a experiencereplay buffer (Mnih et al., 2015), which is a first-in-first-outqueue storing previously experienced transitions.

Suppose the latent feature vector in the last hidden layer ofDQN is denoted by a column vector φ ∈ Rd (where d is thenumber of units in the last hidden layer), then the covariancematrix of features estimated from samples is,

V ar(φ) =1

n

n∑t=1

[φ(st, at)− φ][φ(st, at)− φ]T , (1)

where φ = 1n

∑nt=1 φ(st, at) and V ar(φ) ∈ Rd×d. We

know that if two features φi and φj (the ith and jth unitin the last layer) are decorrelated, then V ar(φ)i,j = 0. Toapply decorrelation incrementally and efficiently, our keyidea is to regularize the DQN loss function with a term thatpenalizes correlation in features.

1

2(rt+1 + γmax

aQθ−(st+1, a)−Qθ(st, at))2+

λ2

d2 − d∑i>j

(V ar(φ)i,j

)2,

where the regularization term is the mean-squared loss ofthe off-diagonal entries in the covariance matrix, which wecall the regularized correlation loss. The other elements ofour algorithm such as experience replay is completely thesame as DQN. We call this new algorithm DQN-decor.

Page 3: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

3.2. QR-DQN with Decorrelation

The core idea behind QR-DQN is the Quantile Regressionintroduced by the seminal paper (Koenker & Bassett Jr,1978). This approach gained significant attention in the fieldof theoretical and applied statistics. Let us first consider QRin the supervised learning. Given data {(xi, yi)}i, we wantto compute the quantile of y corresponding the quantilelevel τ . Linear quantile regression loss is defined as:

L(β) =∑i

ρτ (yi − xiβ) (2)

where

ρτ (u) = u(τ − Iu<0) = τ |u|Iu≥0 + (1− τ)|u|Iu<0 (3)

is the weighted sum of residuals. Weights are proportionalto the counts of the residual signs and order of the estimatedquantile τ . For higher quantiles positive residuals get higherweight and vice versa. If τ = 1

2 , then the estimate of the me-dian for yi is θ1/2(yi|xi) = xiβ, with β = arg minL(β).

Instead of learning the expected return Q, distributionalRL focuses on learning the full distribution of the randomvariable Z directly (Jaquette, 1973; Bellemare et al., 2017).There are various approaches to represent a distribution inRL setting (Bellemare et al., 2017; Dabney et al., 2018;Barth-Maron et al., 2018). In this paper, we focus on thequantile representation (Dabney et al., 2017) used in QR-DQN, where the distribution of Z is represented by a uni-form mix of N supporting quantiles:

Zθ(s, a).=

1

N

N∑i=1

δθi(s,a)

where δx denote a Dirac at x ∈ R, and each θi is an estima-tion of the quantile corresponding to the quantile level (a.k.a.quantile index) τi

.= i−0.5

N for 0 < i ≤ N . The state-actionvalue Q(s, a) is then approximated by 1

N

∑Ni=1 θi(s, a).

Such approximation of a distribution is referred to as quan-tile approximation.

Similar to the Bellman optimality operator in mean-centeredRL, we have the distributional Bellman optimality operatorfor control in distributional RL,

T Z(s, a).= R(s, a) + γZ(s′, arg max

a′Ep,R[Z(s′, a′)])

s′ ∼ p(·|s, a)

Based on the distributional Bellman optimality operator,Dabney et al. (2017) proposed to train quantile estimations(i.e., {qi}) via the Huber quantile regression loss (Huberet al., 1964). To be more specific, at time step t the loss is

1

N

N∑i=1

N∑i′=1

[ρκτi(yt,i′ − θi(st, at)

)]

where

yt,i′.= rt + γθi′

(st+1, arg max

a′

1

N

N∑i=1

θi(st+1, a′))

(4)

andρκτi(x)

.= |τi − I{x < 0}|Lκ(x), (5)

where I is the indicator function and Lκ is the Huber loss,

Lκ(x).=

{12x

2 if x ≤ κκ(|x| − 1

2κ) otherwise

To decorrelate QR-DQN, we use the following loss function,

1

N

N∑i=1

N∑i′=1

[ρκτi(yt,i′ − θi(st, at)

)]+

λ2

d2 − d∑i>j

(V ar(φ)i,j

)2,

where φ denotes the feature encoder by the last layer of QR-DQN, yt,i′ is computed in equation (4), and ρ is computedin equation (5).

The other elements of QR-DQN are also preserved. Theonly difference of our new algorithm from QR-DQN is theregularized correlation loss. We call this new algorithmQR-DQN-decor.

4. ExperimentIn this section, we conduct experiments to study the effec-tiveness of the proposed decorrelation method for DQN andQR-DQN. We used the Atari game environments by Belle-mare et al. (2013). In particular, we compare DQN-decorvs. DQN, and QR-DQN-decor vs. QR-DQN. Algorithmswere evaluated on training with 20 million frames (or equiv-alently, 5 million agent steps) 3 runs per game.

In performing the comparisons, we used the same parame-ters for all the algorithms, which are reported below. Thediscount factor is 0.99. The image size is (84, 84). Weclipped the continuous reward into three discrete values,{0, 1,−1}, according to their sign. This clipping of rewardsignals leads to more stable performance according to (Mnihet al., 2015). The reported performance scores is calculated,however, in terms of the original unclipped rewards. Thetarget networks update frequency is 10, 000 frames. Thelearning rate is 10−4. The exploration strategy is epsilon-greedy, with ε decaying linearly from 1.0 to 0.02 in the first1 million frames, and remains constantly 0.02 after that. Theexperience replay buffer size is 1 million. The minibatchsize for experience replay is 32. In the first 200, 000 frames,all the four agents were behaving randomly to fill the bufferwith experience as a warm start.

Page 4: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Millions of frames

0%

10%

20%

30%

40%

50%

60%

70%

Med

ian

norm

alize

d sc

ore

DQNDQN-decor

Figure 1. Human-normalized performance (using median across49 Atari games): DQN-decor vs. DQN.

The input of the neural networks is (84, 84, 4) taking themost recent 4 frames. There are three convolution layers fol-lowed by a fully connected layer. The first layer convolves32 filters of 8 × 8 with stride 4 with the input image andapplies a rectifier non-linearity. The second layer convolves64 filters of 4 × 4 with strides 2, followed by a rectifiernon-linearity. The third convolution layer convolves 64filters of 3 × 3 with stride 1 followed by a rectifier. Thefinal hidden layer has 512 rectifier units. The outputs ofthese units given an image are just the features encoded in φ.Thus the number of features, d, is 512. This is followed bythe output layer which is fully connected and linear, with asingle output for each action (the number of actions in eachgame is different, ranging from 4 to 18).

The κ parameter for QR-DQN algorithms in the Huber lossfunction was set to κ = 1 following Dabney et al. (2017).

The correlation loss in equation 1 is computed on the mini-batch sampled in experience replay (the empirical mean isalso computed over the minibatch).

4.1. Decorrelating DQN

Performance. First we compare DQN-decor with DQNin Figure 1. The performance is measured by the medianperformance of DQN-decor and DQN across all games,where for each game the average score over runs is firsttaken. DQN-decor is shown to outperform DQN startingfrom about 4 million frames, with the winning edge increas-ing with time. By 20 million frames, DQN-decor achievesa human-normalized score of 70% (while DQN is 50%),outperforming DQN by 40%.

While the median performance is widely used, it is only a

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Millions of frames

10 13

10 11

10 9

10 7

10 5

10 3

Cova

rianc

e lo

ss

DQNDQN-decor

Figure 2. Correlation loss over training (using median across 49Atari games): DQN-decor vs. DQN.

performance summary across all games. To see how algo-rithms perform in each game, we benchmark them for eachgame in Figure 3. DQN-decor won over DQN on 39 games(each of which is more than 5%) with 4 close ties and lostslightly on 6 games. In particular, only on Bowling andBattleZone, it loses to DQN by about 10%. For the other 4losing games, the loss was below 5%.

We dig into the training on Bowling, and found thatDQN fails to learn for this task. The evaluation scorefor Bowling in the original DQN paper (Mnih et al.,2015) is 42.4 (see their Table 2) with a standard devi-ation of 88.0 (trained over 50 million frames). In fact,for a fair comparison, one can check that the trainingperformance for DQN at https://google.github.io/dopamine/baselines/plots.html, becausescores reported in the original DQN paper is the testingperformance without exploration. Their results show thatfor Bowling, the scores of DQN in the training phases fluc-tuate around 30.0, which is close to our implemented DQNhere. Figure 4a shows that two runs (out of three) of DQNis even worse than the random agent. Because the explo-ration factor is close to 1.0 in the beginning, the first datapoints of the curves correspond to the performance of therandom agent. The representation learned by DQN turnsout to be not useful for this task. In fact, the features learnedby these two failure runs of DQN are nearly i.i.d, reflectedin that the correlation loss is nearly zero as shown in Figure4b. So decorrelating the unsuccessful representation doesnot help in this case.

The loss of DQN-decor to DQN on BattleZone is due tothe regularization factor, λ, which is uniformly fixed forall games. In producing these two figures, λ was fixed to

Page 5: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

>150%

Bowl

ing

Battl

eZon

ePo

ngAl

ien

Nam

eThi

sGam

eSp

aceI

nvad

ers

Doub

leDu

nkAs

saul

tRo

bota

nkTe

nnis

Kang

aroo

Free

way

Boxi

ngDe

mon

Atta

ckW

izard

OfW

orM

sPac

man

Cent

iped

eIce

Hock

eyBe

amRi

der

Amid

arUp

NDow

nFr

ostb

iteEn

duro

Brea

kout

Krul

lTu

tank

ham

Bank

Heist

Kung

FuM

aste

rTi

meP

ilot

Grav

itar

Road

Runn

erRi

verra

idJa

mes

bond

Fish

ingD

erby

Chop

perC

omm

and

Goph

erVi

deoP

inba

llAs

terix

Vent

ure

Aste

roid

sQb

ert

Star

Gunn

erAt

lant

isHe

roCr

azyC

limbe

rSe

aque

stZa

xxon

Priv

ateE

yeM

onte

zum

aRev

enge

-20%

0%

20%

40%

60%

80%

100%

120%no

rmed

AUC

Figure 3. Cumulative reward improvement of DQN-decor over DQN. The bar indicates improvement and is computed as the Area Underthe Curve (AUC) in the end of training, normalized relative to DQN. Bars above / below horizon indicate performance gain / loss. ForMonteZumaRevenge, the improvement is 15133.34%; and for PrivateEye, the improvement is 334.87%.

0 4 8 12 16 20Millions o�rames

20

25

30

35

40

45

Rew

ard

1 successful run

2 failure runs

(a) Bowling-DQN: Performance.

0 4 8 12 16 20Millions o�rames

10− 30

10− 15

10− 1

Corr

elat

ion

loss 1 successful run 2 failure runs

(b) Bowling-DQN: Correlation loss.

0 4 8 12 16 20Millions of frames

2500

5000

7500

10000

12500

15000

17500

20000R

ewar

dDQN-decor, λ = 1e− 8

DQN-decor, λ = 0.01

DQN

(c) BattleZone: DQN-decor-λ-10−8 performssignificantly better than DQN.

0.01 for all 49 games. In fact, we did not run parametersearch to optimize λ for all games. The value λ = 0.01 waspicked at our best guess. We did notice that DQN-decor canachieve better performance than DQN for BattleZone witha different regularization factor as shown in Figure 4c. Itappears that the features learned by DQN are already muchdecorrelated, and thus a small regularization factor (in thiscase, 10−8) gives a better performance.

Excluding the unsuccessful representation learned by DQNfor the case of Bowling and better parameter choice for reg-ularization factor, we found that decorrelation is a uniformlyeffective technique for training DQN.

Effect of Feature Correlation. To study the correlationbetween performance and correlation loss, we plot the cor-

relation loss over training time (also using the median mea-sure) across 49 games in Figure 2. The correlation loss iscomputed by averaging over the minibatch samples duringexperience replay at each time step.

As shown by the figure, DQN-decor effectively reduces thedecorrelation in features. We can see that the feature cor-relation loss for DQN-decor is almost flat from no morethan 2 million frames, indicating that decorrelating featurescan be achieved faster than learning (a good news for rep-resentation learning). While for DQN, the learned featureskeep correlating with each other more and more in the first6 million frames. After that, however, it appears that DQNdoes achieve a flat correlation loss, although much biggerthan DQN-decor.

Page 6: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Millions of frames

0%

10%

20%

30%

40%

50%

60%

70%

Med

ian

norm

alize

d sc

ore

QR-DQN-1QR-DQN-decor

Figure 5. Human-normalized performance (using median across49 Atari games): QR-DQN-decor vs. QR-DQN.

This phenomenon of DQN learning is interesting because itshows that although correlation is not considered in its lossfunction, DQN does have the ability of achieving featuredecorrelation (to some extent) over time. Note that DQN’sperformance keeps improving after 6 million frames whileat the same time its correlation loss measure is almost flat.This may indicate that it is natural to understand learningin two phases: decorrelating features (with some loss) andafter that learning and performance continues to improvewith decorrelated features.

Interestingly, faster learning of DQN-decor follows after thefeatures are decorrelated. To see this, DQN-decor’s advan-tage over DQN becomes significant after about 6 millionframes (Figure 1), while DQN-decor’s correlation loss hasbeen flat since less than 2 million frames (Figure 2).

4.2. Decorrelating QR-DQN

We further conduct experiments to study whether our decor-relation method applies to QR-DQN. The value of λ was setto 25.0, picked simply from a parameter study on the gameof Sequest. Figure 5 shows the median human-normalizedperformance scores across 49 games. Similar to the caseof DQN, our method achieves much better performanceby decorrelating QR-DQN, especially after about 7 millionframes. Again, we see the performance edge of QR-DQN-decor over QR-DQN has a trend of increasing with time.We observed a similar correlation between performance andcorrelation loss, which supports that decorrelation is thefactor that improves the performance.

We also profile the performance of algorithms for each gamein the normalized AUC, which is shown in Figure 7. In thiscase, the decorrelation algorithm lost 5 games, with 10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Millions of frames

0

500

1000

1500

2000

2500

Rew

ard

QR-DQN-1

QR-DQN-1-decor, λ = 25.

QR-DQN-1-decor, λ = 0.01

Figure 6. Frostbite: QR-DQN-decor-λ-0.01 won by 42.2% in nor-malized AUC (area under curve).

games being close (less than 5%), and 34 winning games(bigger than 5%). The algorithm lost most to QR-DQNon the game of Frostbite, BattleZone and DemonAttack.This is due to parameter choice instead of an algorithmicissue. For all the three losing games, we performed addi-tional experiments with λ = 0.01 (the same value as forthe DQN experiments). The learning curves are shown inFigure 6 (Frostbite), Figure 8a (BattleZone), and Figure 8b(DemonAttack). Recall that QR-DQN-decor with λ = 25.0lost to QR-DQN by about 25%. Figure 6 shows that withλ = 0.01, the performance of QR-DQN-decor is signifi-cantly improved, with an improvement over QR-DQN by42.2% (in the normalized AUC measure). For the othertwo games, QR-DQN-decor performs at least no worse thanQR-DQN (3.5% better on DemonAttack, −0.18% on Bat-tleZone).

Note that the median human-normalized scores acrossgames and the normalized AUC measure for each gamemay not give a full picture of algorithm performances. Al-gorithms perform well in these two measures can exhibitplummeting behaviour. Plummeting is characterized byabrupt degradation in performance. In this case the learn-ing curve can drop to low score and stay there indefinitely.A more detailed discussion of this point is in (Machadoet al., 2017). To study if our decorrelation algorithms haveplummeting behaviour, we benchmark all algorithms by av-eraging the rewards in the last one million training frames(which is essentially near-tail performance in training). Inthis measure, the results are summarized in Table 1. Out of49 games, decorrelation algorithms perform the best in 39games: DQN-decor is best in 15 games and QR-DQN-decoris best in 24 games. Thus our decorrelation algorithms donot have plummeting behaviour.

Page 7: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

>150%

Fros

tbite

Battl

eZon

eDe

mon

Atta

ckTi

meP

ilot

Pong

Nam

eThi

sGam

eGr

avita

rSp

aceI

nvad

ers

Boxi

ngBo

wlin

gAs

saul

tDo

uble

Dunk

Goph

erRo

bota

nkTe

nnis

Fish

ingD

erby

Free

way

Cent

iped

eIce

Hock

eyBa

nkHe

istRi

verra

idKr

ull

MsP

acm

anKu

ngFu

Mas

ter

Brea

kout

Aste

roid

sW

izard

OfW

orAs

terix

Road

Runn

erVe

ntur

eJa

mes

bond

Vide

oPin

ball

Seaq

uest

Alie

nBe

amRi

der

Chop

perC

omm

and

Amid

arUp

NDow

nTu

tank

ham

Kang

aroo

Qber

tSt

arGu

nner

Craz

yClim

ber

Zaxx

onAt

lant

isEn

duro

Priv

ateE

yeHe

roM

onte

zum

aRev

enge

-20%

0%

20%

40%

60%

80%

100%

120%no

rmed

AUC

Figure 7. Normalized AUC: Cumulative reward improvement of QR-DQN-decor over QR-DQN. For MonteZumaRevenge, the improve-ment is 419.31%; for Hero, the improvement is 414.22%; and for PrivateEye, the improvement is 318.18%.

0 4 8 12 16 20Millions of frames

2500

5000

7500

10000

12500

15000

17500

20000

Rew

ard

QR-DQN-1-decor, λ = 0.01

QR-DQN-1-decor, λ = 25.

QR-DQN-1

(a) BattleZone.

0 4 8 12 16 20Millions of frames

0

10000

20000

30000

40000

Rew

ard

QR-DQN-1-decor, λ = 0.01

QR-DQN-1-decor, λ = 25.

QR-DQN-1

(b) DemonAttack.

Figure 8. Re-comparing QR-DQN-decor (with λ = 0.01) and QR-DQN in normalized AUC. Left (BattleZone): losing to QR-DQNby −0.18%. Right (DemonAttack): winning QR-DQN by 3.5%.

In a summary, empirical results in this section show thatdecorrelation is effective for improving the performanceof both QR-DQN and QR-DQN. Even on the most losinggames (incurred using a fixed, same regularization factorfor all games), both DQN and QR-DQN can still be signifi-cantly improved by our decorrelation method with a tunedregularization factor.

4.3. Analysis on the Learned Representation

To study the correlation between the learned features aftertraining finished, we first compute the feature covariancematrix (see equation 1) using the (one million) samples

0.0

0.5

1.0

1.5

2.0

(a) DQN.

0.000

0.005

0.010

0.015

0.020

(b) DQN-decor.

Figure 9. Features correlation heat map on Sequest. First, the di-agonalization pattern is more obvious in the feature covariancematrix of DQN-decor. Second, the magnititude of feature corre-lation in DQN is much larger than DQN-decor. Third, there aremore highly activated features in DQN and they correlate witheach other, suggesting that many of them may be redundent.

in the experience replay buffer at the end of training, forboth DQN and DQN-decor. Then we sort the (512) featuresaccording to their variances, which are just the diagonalparts of the feature covariance matrix.

A visualization of the top 50 features on the game of Sequestis shown in Figure 9a (DQN), and Figure 9b (DQN-decor).Three observations can be made. First, the diagonalizationpattern is more obvious in the feature covariance matrix ofDQN-decor. Second, the magnititude of feature correlation

Page 8: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

Figure 10. Activation of the most important features (according to their variances) on Sequest (6 random samples). Top row: features ofDQN-dec. Middle row: features of DQN. Bottom row: current image frame.

in DQN is much larger than DQN-decor (note the heatintensity bar on the right for the two plots is different).Third, there are more highly activated features in DQN,reflected in that the diagonal parts of DQN’s matrix havemore intense values (black color). Interestingly, there arealso almost same number of equally sized squares in theoff-diagonal region of DQN’s matrix, indicating that thesehighly activated features correlate with each other and manyof them may be redundent. In contrast, the off-diagonalregion of DQN-decor’s matrix has very few intense values,suggesting that these important features are successfullydecorrelated.

Figure 10 shows randomly sampled frames and their featureactivation values (same set of 50 features as in Figure 9a andFigure 9b) for both algorithms. Interestingly, the featuresof DQN-decor (top row) appear to have much fewer valuesthat are activated at the same time given an input image. Inaddition, DQN also has many highly activated features atthe same time. In contrast, for DQN-decor, there are justa few (in these samples, there is only one) highly activefeatures. Interpretting features for DRL algorithms may bemade easier thanks to decorrelation. Although we have notany conclusion on this yet, it is an interesting future researchdirection.

5. ConclusionIn this paper, we found that feature correlation is a keyfactor in the performance of DQN algorithms. We haveproposed a method for obtaining decorrelated features forDQN. The key idea is to regularize the loss function ofDQN by a correlation loss computed from the mean squaredcorrelation between the features. Our decorrelation methodturns out to be very effective for training DQN algorithms.We show that it also applies to QR-DQN, improving QR-

DQN significantly on most games, and losing only on a fewgames. Experiment study on the losing games shows that itis due to the parameter choice for the regularization factor.Our decorrelation method effectively improves DQN andQR-DQN with a better or no worse performance across mostgames, which makes it hopeful for improving representationwith the other DRL algorithms.

A. All Learning CurvesThe learning curves of DQN-decor vs. DQN on all gamesare shown in Figure 11. (λ = 0.01).

The learning curves of QR-DQN-decor vs. QR-DQN areshown in Figure 12. (λ = 25.0).

ReferencesAlbus, J. S. Data storage in the cerebellar model artic-

ulation controller (CMAC). Journal of Dynamic Sys-tems,Measurement and Control, 97(3):228–233, 1975.

Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney,W., Horgan, D., Muldal, A., Heess, N., and Lillicrap, T.Distributed distributional deterministic policy gradients.arXiv preprint arXiv:1804.08617, 2018.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial IntelligenceResearch, 47:253–279, 2013.

Bellemare, M. G., Dabney, W., and Munos, R. A distri-butional perspective on reinforcement learning. arXivpreprint arXiv:1707.06887, 2017.

Bellman, R. Dynamic programming. Courier Corporation,2013.

Page 9: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

Dabney, W., Rowland, M., Bellemare, M. G., and Munos,R. Distributional reinforcement learning with quantileregression. arXiv preprint arXiv:1710.10044, 2017.

Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Im-plicit quantile networks for distributional reinforcementlearning. arXiv preprint arXiv:1806.06923, 2018.

Ghiassian, S., Yu, H., Rafiee, B., and Sutton, R. S. Twogeometric input transformation methods for fast on-line reinforcement learning with neural nets. CoRR,abs/1805.07476, 2018. URL http://arxiv.org/abs/1805.07476.

Huber, P. J. et al. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 1964.

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce-ment learning with unsupervised auxiliary tasks. CoRR,abs/1611.05397, 2016. URL http://arxiv.org/abs/1611.05397.

Jaquette, S. C. Markov decision processes with a new opti-mality criterion: Discrete time. The Annals of Statistics,1973.

Koenker, R. and Bassett Jr, G. Regression quantiles. Econo-metrica: Journal of the Econometric Society, pp. 33–50,1978.

Konidaris, G., Osentoski, S., and Thomas, P. Value func-tion approximation in reinforcement learning using thefourier basis. In Proceedings of the Twenty-Fifth AAAIConference on Artificial Intelligence, AAAI’11, pp. 380–385. AAAI Press, 2011. URL http://dl.acm.org/citation.cfm?id=2900423.2900483.

Liu, V., Kumaraswamy, R., Le, L., and White, M. The utilityof sparse representations for control in reinforcementlearning. CoRR, abs/1811.06626, 2018. URL http://arxiv.org/abs/1811.06626.

Machado, M. C., Bellemare, M. G., Talvitie, E., Ve-ness, J., Hausknecht, M., and Bowling, M. Revisitingthe arcade learning environment: Evaluation protocolsand open problems for general agents. arXiv preprintarXiv:1709.06009, 2017.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., andLittman, M. L. An analysis of linear models, linearvalue-function approximation, and feature selection for

reinforcement learning. In Proceedings of the 25th inter-national conference on Machine learning, pp. 752–759.ACM, 2008.

Petrik, M. An analysis of laplacian methods for valuefunction approximation in mdps. In Proceedingsof the 20th International Joint Conference on Artifi-cal Intelligence, IJCAI’07, pp. 2574–2579, San Fran-cisco, CA, USA, 2007. Morgan Kaufmann PublishersInc. URL http://dl.acm.org/citation.cfm?id=1625275.1625690.

Silver, D. Reinforcement Learning and Simulation-BasedSearch in Computer Go. PhD thesis, 2009.

Sutton, R. S. and Barto, A. G. Reinforcement learning: Anintroduction (2nd Edition). MIT press, 2018.

Szepesvari, C. Algorithms for Reinforcement Learning.Morgan and Claypool, 2010.

Watkins, C. J. and Dayan, P. Q-learning. Machine Learning,1992.

Page 10: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

Game DQN DQN-decor, λ = 0.01 QR-DQN-1 QR-DQN-1-decor, λ = 25Alien 1445.8 1348.5 1198.0 1496.0Amidar 234.9 263.8 236.5 339.0Assault 2052.6 1950.8 6118.2 5341.3Asterix 3880.2 5541.7 6978.4 8418.6Asteroids 733.2 1292.0 1442.4 1731.4Atlantis 189008.6 307251.6 65875.3 148162.4BankHeist 568.4 648.1 730.3 740.8BattleZone 15732.2 14945.3 18001.3 17852.6BeamRider 5193.1 5394.4 5723.6 6483.1Bowling 27.3 21.9 22.5 22.2Boxing 85.6 86.0 87.1 83.9Breakout 311.3 337.7 372.8 393.3Centipede 2161.2 2360.5 6003.0 6092.9ChopperCommand 1362.4 1735.2 2266.9 2777.4CrazyClimber 69023.8 100318.4 74110.1 100278.4DemonAttack 7679.6 7471.8 34845.7 27393.6DoubleDunk -15.5 -16.8 -18.5 -19.1Enduro 808.3 891.7 409.4 884.5FishingDerby 0.7 11.7 9.0 10.8Freeway 23.0 32.4 25.6 24.9Frostbite 293.8 376.6 1414.2 755.7Gopher 2064.5 3067.6 2816.5 3451.8Gravitar 271.2 382.3 305.9 314.9Hero 3025.4 6197.1 1948.4 9352.2IceHockey -10.0 -8.6 -10.3 -9.6Jamesbond 387.5 471.0 391.4 515.0Kangaroo 3933.3 3955.5 1987.8 2504.4Krull 5709.9 6286.4 6547.7 6567.9KungFuMaster 16999.0 20482.9 22131.3 23531.4MontezumaRevenge 0.0 0.0 0.0 2.4MsPacman 2019.0 2166.0 2221.4 2407.9NameThisGame 7699.0 7578.2 8407.0 8341.0Pong 19.9 20.0 19.9 20.0PrivateEye 345.6 610.8 41.7 251.0Qbert 2823.5 4432.4 4041.0 5148.1Riverraid 6431.3 7613.8 7134.8 7700.5RoadRunner 35898.6 39327.0 36800.0 37917.6Robotank 24.8 24.5 31.3 30.6Seaquest 4216.6 6635.7 4856.8 5224.6SpaceInvaders 1015.8 913.0 946.7 825.1StarGunner 15586.6 21825.0 25530.8 37461.0Tennis -22.3 -21.2 -17.3 -16.2TimePilot 2802.8 3852.1 3655.8 3651.6Tutankham 103.4 116.2 148.3 190.2UpNDown 8234.5 9105.8 8647.4 11342.4Venture 8.4 15.3 0.7 1.1VideoPinball 11564.1 15759.3 53207.5 66439.4WizardOfWor 1804.3 2030.3 2109.8 2530.6Zaxxon 3105.2 7049.4 4179.8 6816.9

Table 1. Comparison of gaming scores obtained by our decorrelation algorithms with DQN and QR-DQN, averaged over the last onemillion training frames (the total number training frames is 20 million). Out of 49 games, decorrelation algorithms perform the best in 39games: DQN-decor is best in 15 games and QR-DQN-decor is best in 24 games.

Page 11: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

0.5

1.0

1.51e3 Alien

0

200

Amidar

123

1e3 Assault

0.0

2.5

5.0

1e3 Asterix

0.5

1.0

1e3 Asteroids

0

2

41e5 Atlantis

0

250

500

BankHeist

0.51.01.5

1e4 BattleZone

2

4

1e3 BeamRider

20

25

30Bowling

0

50

Boxing

0

200

Breakout

2

3

41e3 Centipede

1

21e3ChopperCommand

0.5

1.01e5 CrazyClimber

0.0

0.5

1e4 DemonAttack

20.0

17.5

15.0DoubleDunk

0.0

0.5

1e3 Enduro

100

50

0

FishingDerby

0

20

Freeway

200

400

Frostbite

123

1e3 Gopher

200

400

Gravitar

0.0

2.5

5.0

1e3 Hero

12

10

8IceHockey

0

200

400

Jamesbond

0

2

41e3 Kangaroo

2

4

61e3 Krull

0

1

21e4 KungFuMaster

0.0

0.5

1.0

MontezumaRevenge

1

21e3 MsPacman

0.25

0.50

0.751e4 NameThisGame

20

0

20Pong

0.0

0.5

1.01e3 PrivateEye

0

2

4

1e3 Qbert

0.25

0.50

0.751e4 Riverraid

0

2

41e4 RoadRunner

10

20

Robotank

0.0

0.5

1e4 Seaquest

0.5

1.01e3 SpaceInvaders

0

1

2

1e4 StarGunner

24

22

Tennis

2

3

41e3 TimePilot

050

100150

Tutankham

0.0

0.5

1.01e4 UpNDown

0

20

Venture

0

2

1e4 VideoPinball

1

21e3 WizardOfWor

0.0

0.5

1e4 Zaxxon

Figure 11. DQN-decor (orange) vs. DQN (blue): all learning curves for 49 Atari games.

Page 12: Deep Reinforcement Learning with DecorrelationBorislav Mavrin1 Hengshuai Yao2 Linglong Kong1 2 Abstract Learning an effective representation for high-dimensional data is a challenging

Deep Reinforcement Learning with Decorrelation

0.5

1.0

1.51e3 Alien

0

200

Amidar

0.0

2.5

5.0

1e3 Assault

0.0

0.5

1.01e4 Asterix

0.5

1.0

1.5

1e3 Asteroids

0

1

2

1e5 Atlantis

0.0

0.5

1e3 BankHeist

1

21e4 BattleZone

0.0

2.5

5.0

1e3 BeamRider

22.5

25.0

27.5Bowling

0

50

Boxing

0

200

400Breakout

2

4

61e3 Centipede

1

2

31e3ChopperCommand

0.5

1.01e5 CrazyClimber

0

2

41e4 DemonAttack

22

20

18

DoubleDunk

0.0

0.5

1e3 Enduro

100

50

0

FishingDerby

0

10

20

Freeway

0

1

1e3 Frostbite

2

41e3 Gopher

200

300

Gravitar

0.0

0.5

1.01e4 Hero

14

12

10

IceHockey

0

250

500

Jamesbond

0

2

1e3 Kangaroo

2.5

5.0

1e3 Krull

0

1

2

1e4 KungFuMaster

0.0

2.5

5.0

MontezumaRevenge

1

2

1e3 MsPacman

0.25

0.50

0.75

1e4 NameThisGame

20

0

20Pong

0.0

0.5

1e3 PrivateEye

0.0

2.5

5.01e3 Qbert

0.25

0.50

0.751e4 Riverraid

0

2

41e4 RoadRunner

0102030

Robotank

0

2

4

1e3 Seaquest

0.5

1.01e3 SpaceInvaders

0

2

41e4 StarGunner

22.520.017.515.0

Tennis

2

41e3 TimePilot

0

100

200Tutankham

0.0

0.5

1.0

1e4 UpNDown

0

2

4

Venture

0.0

0.5

1e5 VideoPinball

1

2

1e3 WizardOfWor

0.0

0.5

1e4 Zaxxon

Figure 12. QR-DQN-decor (orange) vs. QR-DQN (blue): all learning curves for 49 Atari games.