A Geometric Perspective on Optimal Representations for ... · Pablo Samuel Castro 1, Nicolas Le Roux , Dale Schuurmans;4, Tor Lattimore2, Clare Lyle5 Abstract We propose a new perspective

A Geometric Perspective on Optimal Representationsfor Reinforcement Learning

Marc G. Bellemare1, Will Dabney2, Robert Dadashi1, Adrien Ali Taiga1,3,Pablo Samuel Castro1, Nicolas Le Roux1, Dale Schuurmans1,4, Tor Lattimore2, Clare Lyle5

Abstract

We propose a new perspective on representation learning in reinforcement learningbased on geometric properties of the space of value functions. We leverage thisperspective to provide formal evidence regarding the usefulness of value functionsas auxiliary tasks. Our formulation considers adapting the representation to mini-mize the (linear) approximation of the value function of all stationary policies fora given environment. We show that this optimization reduces to making accuratepredictions regarding a special class of value functions which we call adversarialvalue functions (AVFs). We demonstrate that using value functions as auxiliarytasks corresponds to an expected-error relaxation of our formulation, with AVFsa natural candidate, and identify a close relationship with proto-value functions(Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness asauxiliary tasks in a series of experiments on the four-room domain.

1 Introduction

A good representation of state is key to practical success in reinforcement learning. While earlyapplications used hand-engineered features (Samuel, 1959), these have proven onerous to generateand difficult to scale. As a result, methods in representation learning have flourished, ranging frombasis adaptation (Menache et al., 2005; Keller et al., 2006), gradient-based learning (Yu and Bertsekas,2009), proto-value functions (Mahadevan and Maggioni, 2007), feature generation schemes such astile coding (Sutton, 1996) and the domain-independent features used in some Atari 2600 game-playingagents (Bellemare et al., 2013; Liang et al., 2016), and nonparametric methods (Ernst et al., 2005;Farahmand et al., 2016; Tosatto et al., 2017). Today, the method of choice is deep learning. Deeplearning has made its mark by showing it can learn complex representations of relatively unprocessedinputs using gradient-based optimization (Tesauro, 1995; Mnih et al., 2015; Silver et al., 2016).

Most current deep reinforcement learning methods augment their main objective with additionallosses called auxiliary tasks, typically with the aim of facilitating and regularizing the representationlearning process. The UNREAL algorithm, for example, makes predictions about future pixelvalues (Jaderberg et al., 2017); recent work approximates a one-step transition model to achieve asimilar effect (Francois-Lavet et al., 2018; Gelada et al., 2019). The good empirical performance ofdistributional reinforcement learning (Bellemare et al., 2017) has also been attributed to representationlearning effects, with recent visualizations supporting this claim (Such et al., 2019). However, whilethere is now conclusive empirical evidence of the usefulness of auxiliary tasks, their design andjustification remain on the whole ad-hoc. One of our main contributions is to provides a formalframework in which to reason about auxiliary tasks in reinforcement learning.

We begin by formulating an optimization problem whose solution is a form of optimal representation.Specifically, we seek a state representation from which we can best approximate the value function ofany stationary policy for a given Markov Decision Process. Simultaneously, the largest approximation

1Google Brain 2DeepMind 3Mila, Universite de Montreal 4University of Alberta 5University of Oxford

Preprint. Under review.

arX

iv:1

901.

1153

0v2

[cs

.LG

] 2

6 Ju

n 20

19

error in that class serves as a measure of the quality of the representation. While our approachmay appear naive – in real settings, most policies are uninteresting and hence may distract therepresentation learning process – we show that our representation learning problem can in fact berestricted to a special subset of value functions which we call adversarial value functions (AVFs).We then characterize these adversarial value functions and show they correspond to deterministicpolicies that either minimize or maximize the expected return at each state, based on the solution of anetwork-flow optimization derived from an interest function δ.

A consequence of our work is to formalize why predicting value function-like objects is helpfulin learning representations, as has been argued in the past (Sutton et al., 2011, 2016). We showhow using these predictions as auxiliary tasks can be interpreted as a relaxation of our optimizationproblem. From our analysis, we hypothesize that auxiliary tasks that resemble adversarial valuefunctions should give rise to good representations in practice. We complement our theoretical resultswith an empirical study in a simple grid world environment, focusing on the use of deep learningtechniques to learn representations. We find that predicting adversarial value functions as auxiliarytasks leads to rich representations.

2 Setting

We consider an environment described by a Markov Decision Process 〈X ,A, r, P, γ〉 (Puterman,1994); X and A are finite state and action spaces, P : X × A → P(X ) is the transition function,γ the discount factor, and r : X → R the reward function. For a finite set S, write P(S) for theprobability simplex over S . A (stationary) policy π is a mapping X →P(A), also denoted π(a |x).We denote the set of policies by P = P(A)X . We combine a policy π with the transition function Pto obtain the state-to-state transition function Pπ(x′ |x) :=

∑a∈A π(a |x)P (x′ |x, a). The value

function V π describes the expected discounted sum of rewards obtained by following π:

V π(x) = E[ ∞∑t=0

γtr(xt)∣∣x0 = x, xt+1 ∼ Pπ(· |xt)

].

The value function satisfies Bellman’s equation (Bellman, 1957): V π(x) = r(x) + γ EPπ V π(x′).We will find it convenient to use vector notation: Assuming there are n = |X | states, we view r andV π as vectors in Rn and Pπ ∈ Rn×n, yielding

V π = r + γPπV π = (I − γPπ)−1r.

A d-dimensional representation is a mapping φ : X → Rd; φ(x) is the feature vector for state x. Wewrite Φ ∈ Rn×d to denote the matrix whose rows are φ(X ), and with some abuse of notation denotethe set of d-dimensional representations by R ≡ Rn×d. For a given representation and weight vectorθ ∈ Rd, the linear approximation for a value function is

Vφ,θ(x) := φ(x)>θ.

We consider the approximation minimizing the uniformly weighted squared error∥∥Vφ,θ − V π∥∥2

2=∑x∈X

(φ(x)>θ − V π(x))2.

We denote by V πφ the projection of V π onto the linear subspace H ={

Φθ : θ ∈ Rd}

.

2.1 Two-Part Approximation

We view V πφ as a two-part approximation arising from the composition of an adjustable representationφ and a weight vector θ; we use the term “two-part” to emphasize that the mapping φ(x) 7→ V πφ (x) islinear, while φ itself may not be. This separation into two parts gives us a simple framework in whichto study the behaviour of representation learning, in particular deep networks applied to reinforcementlearning. We will further consider the use of φ(x) to make additional predictions, called auxiliarytasks following common usage, and whose purpose is to improve or stabilize the representation.

We study two-part approximations in an idealized setting where the length d of φ(x) is fixed andsmaller than n, but the mapping is otherwise unconstrained. Even this idealized design offers

2

Representation

�(x)<latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit>

x<latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit>

State Value

✓<latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit>

Goalstate

V�(x)<latexit sha1_base64="hQTqaSVnMtWOt6x1pLM8o8Z6p9E=">AAAI5XicfZVbj9tEFMfdcmkItxYeeRkRrVRQFMVcRFWE1KWBgthdstnddLvrNBqPT+zR+sbMpCQ79UfgDSHeEHwTPgbfhjO2k/jS1lLi4/P7H8+ZM8czbhpyqYbD/27cfO31N9681Xmr+/Y77773/u07H0xlshQMzlgSJuLcpRJCHsOZ4iqE81QAjdwQHrtXDw1//AyE5El8qtYpzCLqx3zBGVXoeuoEVJHp3EkDfnf1yfx2bzgY5hdpG3Zp9KzyGs/v3PrX8RK2jCBWLKRSXtrDVM00FYqzELKus5SQUnZFfbhEM6YRyJnO087IHno8skgE/mJFcm81QtNIynXkojKiKpBNZpwvYyqIaqNroxJyIRs5qcW9meZxulQQsyKlxTIkKiGmWsTjApgK12hQJjjOirCACsoU1rQ+ZugnKGgOu3VzVgcCQsmvGyXynvFUlkVaFVWqBbn49u4eXmR/+j1JaQpYMA8WxAmmKcdFJLpY0adOyvNVzUp+Amqc47xaTOhJVkWGsHGWv7vbdWL4lSVRRGMP3SdZEcVoqE9MVI2eV+h5iz6p0Cctul+h+y06qdBJi44qdNSi4wodt+hFhV606GGFHrboaYWetuhBhR4grWN5XMHHzWA5rdBpK9g9KrHr6qNmrDvZwVatsEuibLfy49armUsFCvrPnX5zQl6JnBcwCRE37GuD6mwVZ3o11/b92Ay2VzQarFL8mMg20e9MoiPAjUPAIbp+xkSpSsSnupRmutCiwAexCVwQDGzkUuhltrGcOAl5xNHTLbtcBMk1vgFvc+0oWCnJ9HW2/QjQL5tY1rAqsdr4riO6Ivpip8fnXYR3FRI92sGfDrJN3VUAiYBIh4DLcgA4hazmF8Y/gYiKqzpguB/oh4lIQqzIus7MzqNPi4fmasQJNsClPcM+4T5xnpOeTUqzUUceL14hnhuu1q0YDze3PKJc2B8znHcIC0Uu82jB/UCRWVYuBh5V+d6mPZyhAC/TwnczPRzc6+Nh0x9mbZUvAOKtDjWDL422+RWESygSMUUvonOf7tnNjjHjNqTbdNrqYvwX6MvE8ojuy7oZ2zfiqMK70zfWq4TYRKUQrU33jh+RYmQR6fGjbZMFR+WGf7TxjMy/dhJz5oPCnVHf/2YrH8EvdWYQHvp284hvG9PPBvbnA/v4i96Db8vjv2N9ZH1s3bVs6yvrgfWDNbbOLGYJ60/rb+ufjt/5rfN7549CevNGGfOhVbs6f/0PKx89cg==</latexit>

Auxiliarytasks

V ⇢ Rn<latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit>

V ⇡�

<latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit>

Figure 1: Left. A deep reinforcement learning architecture viewed as a two-part approximation.Right. The optimal representation φ∗ is a linear subspace that cuts through the value polytope.

interesting problems to study. We might be interested in sharing a representation across problems, asis often done in transfer or continual learning. In this context, auxiliary tasks may inform how thevalue function should generalize to these new problems. In many problems of interest, the weightsθ can also be optimized more efficiently than the representation itself, warranting the view that therepresentation should be adapted using a different process (Levine et al., 2017; Chung et al., 2019).

Note that a trivial “value-as-feature” representation exists for the single-policy optimization problem

minφ∈R

∥∥V πφ − V π∥∥2

2;

this approximation sets φ(x) = V π(x), θ = 1. In this paper we take the stance that this is not asatisfying representation, and that a good representation should be in the service of a broader goal(e.g. control, transfer, or fairness).

3 Representation Learning by Approximating Value Functions

We measure the quality of a representation φ in terms of how well it can approximate all possiblevalue functions, formalized as the representation error

L(φ) := maxπ∈P

L(φ;π), L(φ;π) :=∥∥V πφ − V π∥∥2

2.

We consider the problem of finding the representation φ ∈ R minimizing L(φ):

minφ∈R

maxπ∈P

∥∥V πφ − V π∥∥2

2. (1)

In the context of our work, we call this the representation learning problem (RLP) and say that arepresentation φ∗ is optimal when it minimizes the error in (1). Note that L(φ) (and hence φ∗)depends on characteristics of the environment, in particular on both reward and transition functions.

We consider the RLP from a geometric perspective (Figure 1, right). Dadashi et al. (2019) showedthat the set of value functions achieved by the set of policies P , denoted

V := {V ∈ Rn : V = V π for some π ∈ P},forms a (possibly nonconvex) polytope. As previously noted, a given representation φ defines a linearsubspace H of possible value approximations. The maximal error is achieved by the value functionin V which is furthest along the subspace normal to H , since V πφ is the orthogonal projection of V π .

We say that V ∈ V is an extremal vertex if it is a vertex of the convex hull of V . Our first result showsthat for any direction δ ∈ Rn, the furthest point in V along δ is an extremal vertex, and is in generalunique for this δ (proof in the appendix).Lemma 1. Let δ ∈ Rn and define the functional fδ(V ) := δ>V , with domain V . Then fδ ismaximized by an extremal vertex U ∈ V , and there is a deterministic policy π for which V π = U .Furthermore, the set of directions δ ∈ Rn for which the maximum of fδ is achieved by multipleextremal vertices has Lebesgue measure zero in Rn.

Denote by Pv the set of policies corresponding to extremal vertices of V . We next derive anequivalence between the RLP and an optimization problem which only considers policies in Pv .

3

Theorem 1. For any representation φ ∈ R, the maximal approximation error measured over allvalue functions is the same as the error measured over the set of extremal vertices:

maxπ∈P

∥∥V πφ − V π∥∥2

2= maxπ∈Pv

∥∥V πφ − V π∥∥2

2.

Theorem 1 indicates that we can find an optimal representation by considering a finite (albeitexponential) number of value functions, since each extremal vertex corresponds to the value functionof some deterministic policy, of which there are at most an exponential number. We will call theseadversarial value functions (AVFs), because of the minimax flavour of the RLP.

Solving the RLP allows us to provide quantifiable guarantees on the performance of certain value-basedlearning algorithms. For example, in the context of least-squares policy iteration (LSPI; Lagoudakisand Parr, 2003), minimizing the representation error L directly improves the performance bound. Bycontrast, we cannot have the same guarantee if φ is learned by minimizing the approximation errorfor a single value function.Corollary 1. Let φ∗ be an optimal representation in the RLP. Consider the sequence of policiesπ0, π1, . . . derived from LSPI using φ∗ to approximate V π0 , V π1 , . . . under a uniform sampling ofthe state-space. Then there exists an MDP-dependent constant C ∈ R such that

lim supk→∞

∥∥V ∗ − V πk∥∥2

2≤ CL(φ∗).

This result is a direct application of the quadratic norm bounds given by Munos (2003), in whosework the constant is made explicit. We emphasize that the result is illustrative; our approach shouldenable similar guarantees in other contexts (e.g. Munos, 2007; Petrik and Zilberstein, 2011).

3.1 The Structure of Adversarial Value Functions

The RLP suggests that an agent trained to predict various value functions should develop a good staterepresentation. Intuitively, one may worry that there are simply too many “uninteresting” policies, andthat a representation learned from their value functions emphasizes the wrong quantities. However,the search for an optimal representation φ∗ is closely tied to the much smaller set of adversarial valuefunctions (AVFs). The aim of this section is to characterize the structure of AVFs and show that theyform an interesting subset of all value functions. From this, we argue that their use as auxiliary tasksshould also produce structured representations.

From Lemma 1, recall that an AVF is geometrically defined using a vector δ ∈ Rn and the functionalfδ(V ) := δ>V , which the AVF maximizes. Since fδ is restricted to the value polytope, we canconsider the equivalent policy-space functional gδ : π 7→ δ>V π . Observe that

maxπ∈P

gδ(π) = maxπ∈P

δ>V π = maxπ∈P

∑x∈X

δ(x)V π(x). (2)

In this optimization problem, the vector δ defines a weighting over the state space X ; for this reason,we call δ an interest function in the context of AVFs. Whenever δ ≥ 0 componentwise, we recoverthe optimal value function, irrespective of the exact magnitude of δ (Bertsekas, 2012). If δ(x) < 0for some x, however, the maximization becomes a minimization. As the next result shows, the policymaximizing fδ(π) depends on a network flow dπ derived from δ and the transition function P .Theorem 2. Maximizing the functional gδ is equivalent to finding a network flow dπ that satisfies areverse Bellman equation:

maxπ∈P

δ>V π = maxπ∈P

d>π r, dπ = δ + γPπ>dπ.

For a policy π maximizing the above we have

V π(x) = r(x) + γ

{maxa∈A Ex′∼P V π(x′) dπ(x) > 0,mina∈A Ex′∼P V π(x′) dπ(x) < 0.

Corollary 2. There are at most 2n distinct adversarial value functions.

The vector dπ corresponds to the sum of discounted interest weights flowing through a state x, similarto the dual variables in the theory of linear programming for MDPs (Puterman, 1994). Theorem 2, byway of the corollary, implies that there are fewer AVFs (≤ 2n) than deterministic policies (= |A|n).It also implies that AVFs relate to a reward-driven purpose, similar to how the optimal value functiondescribes the goal of maximizing return. We will illustrate this point empirically in Section 4.1.

4

3.2 Relationship to Auxiliary Tasks

So far we have argued that solving the RLP leads to a representation which is optimal in a meaningfulsense. However, solving the RLP seems computationally intractable: there are an exponential numberof deterministic policies to consider (Prop. 1 in the appendix gives a quadratic formulation withquadratic constraints). Using interest functions does not mitigate this difficulty: the computationalproblem of finding the AVF for a single interest function is NP-hard, even when restricted todeterministic MDPs (Prop. 2 in the appendix).

Instead, in this section we consider a relaxation of the RLP and show that this relaxation describesexisting representation learning methods, in particular those that use auxiliary tasks. Let ξ be somedistribution over Rn. We begin by replacing the maximum in (1) by an expectation:

minφ∈R

EV∼ξ

∥∥Vφ − V ∥∥2

2. (3)

The use of the expectation offers three practical advantages over the use of the maximum. First, thisleads to a differentiable objective which can be minimized using deep learning techniques. Second,the choice of ξ gives us an additional degree of freedom; in particular, ξ needs not be restricted to thevalue polytope. Third, the minimizer in (3) is easily characterized, as the following theorem shows.Theorem 3. Let u∗1, . . . , u

∗d ∈ Rn be the principal components of the distribution ξ, in the sense that

u∗i := arg maxu∈Bi

EV∼ξ

(u>V )2, where Bi := {u ∈ Rn : ‖u‖22 = 1, u>u∗j = 0 ∀j < i}.

Equivalently, u∗1, . . . , u∗d are the eigenvectors of Eξ V V > ∈ Rn×n with the d largest eigenvalues.

Then the matrix [u∗1, . . . , u∗d] ∈ Rn×d, viewed as a map X → Rd, is a solution to (3). When

the principal components are uniquely defined, any minimizer of (3) spans the same subspace asu∗1, . . . , u

∗d.

One may expect the quality of the learned representation to depend on how closely the distribution ξrelates to the RLP. From an auxiliary tasks perspective, this corresponds to choosing tasks that are insome sense useful. For example, generating value functions from the uniform distribution over the setof policies P , while a natural choice, may put too much weight on “uninteresting” value functions.

In practice, we may further restrict ξ to a finite set V . Under a uniform weighting, this leads to arepresentation loss

L(φ;V ) :=∑V ∈V

∥∥Vφ − V ∥∥2

2(4)

which corresponds to the typical formulation of an auxiliary-task loss (e.g. Jaderberg et al., 2017). Ina deep reinforcement learning setting, one typically minimizes (4) using stochastic gradient descentmethods, which scale better than batch methods such as singular value decomposition (but see Wuet al. (2019) for further discussion).

Our analysis leads us to conclude that, in many cases of interest, the use of auxiliary tasks producesrepresentations that are close to the principal components of the set of tasks under consideration. IfV is well-aligned with the RLP, minimizing L(φ;V ) should give rise to a reasonable representation.To demonstrate the power of this approach, in Section 4 we will study the case when the set V isconstructed by sampling AVFs – emphasizing the policies that support the solution to the RLP.

3.3 Relationship to Proto-Value Functions

Proto-value functions (Mahadevan and Maggioni, 2007, PVF) are a family of representations whichvary smoothly across the state space. Although the original formulation defines this representation asthe largest-eigenvalue eigenvectors of the Laplacian of the transition function’s graphical structure,recent formulations use the top singular vectors of (I − γPπ)−1, where π is the uniformly randompolicy (Stachenfeld et al., 2014; Machado et al., 2017; Behzadian and Petrik, 2018).

In line with the analysis of the previous section, proto-value functions can also be interpreted asdefining a set of value-based auxiliary tasks. Specifically, if we define an indicator reward functionry(x) := I[x=y] and a set of value functions V = {(I − γPπ)−1ry}y∈X with π the uniformlyrandom policy, then any d-dimensional representation that minimizes (4) spans the same basis as thed-dimensional PVF (up to the bias term). This suggests a connection with hindsight experience replay(Andrychowicz et al., 2017), whose auxiliary tasks consists in reaching previously experienced states.

5

4 Empirical Studies

In this section we complement our theoretical analysis with an experimental study. In turn, we takea closer look at 1) the structure of adversarial value functions, 2) the shape of representationslearned using AVFs, and 3) the performance profile of these representations in a control setting.

Our eventual goal is to demonstrate that the representation learning problem (1), which is based onapproximating value functions, gives rise to representations that are both interesting and comparableto previously proposed schemes. Our concrete instantiation (Algorithm 1) uses the representationloss (4). As-is, this algorithm is of limited practical relevance (our AVFs are learned using a tabularrepresentation) but we believe provides an inspirational basis for further developments.

Algorithm 1 Representation learning using AVFsinput k – desired number of AVFs, d – desired number of features.

Sample δ1, . . . , δk ∼ [−1, 1]n

Compute µi = arg maxπ δ>i V

π using a policy gradient methodFind φ∗ = arg minφ L(φ; {V µ1 , . . . , V µk}) (Equation 4)

We perform all of our experiments within the four-room domain (Sutton et al., 1999; Solway et al.,2014; Machado et al., 2017, Figure 2, see also Appendix H.1).

Goalstate

Interest function ẟ PolicyAdversarial value functiond⇡<latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit>

Network flow ⇡<latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit>

Figure 2: Leftmost. The four-room domain. Other panels. An interest function δ, the network flowdπ , the corresponding adversarial value function (blue/red = low/high value) and its policy.

We consider a two-part approximation where we pretrain φ end-to-end to predict a set of valuefunctions. Our aim here is to compare the effects of using different sets of value functions, includingAVFs, on the learned representation. As our focus is on the efficient use of a d-dimensional represen-tation (with d < n, the number of states), we encode individual states as one-hot vectors and mapthem into φ(x) without capacity constraints. Additional details may be found in Appendix H.

4.1 Adversarial Value Functions

Our first set of results studies the structure of adversarial value functions in the four-room domain.We generated interest functions by assigning a value δ(x) ∈ {−1, 0, 1} uniformly at random to eachstate x (Figure 2, left). We restricted δ to these discrete choices for illustrative purposes.

We then used model-based policy gradient (Sutton et al., 2000) to find the policy maximizing∑x∈X δ(x)V π(x). We observed some local minima or accumulation points but as a whole reasonable

solutions were found. The resulting network flow and AVF for a particular sample are shown in Figure2. For most states, the signs of δ and dπ agree; however, this is not true of all states (larger versionand more examples in appendix, Figures 6, 7). As expected, states for which dπ > 0 (respectively,dπ < 0) correspond to states maximizing (resp. minimizing) the value function. Finally, we remarkon the “flow” nature of dπ: trajectories over minimizing states accumulate in corners or loops, whilethose over maximizing states flow to the goal. We conclude that AVFs exhibit interesting structure,and are generated by policies that are not random (Figure 2, right). As we will see next, this is a keydifferentiator in making AVFs good auxiliary tasks.

4.2 Representation Learning with AVFs

We next consider the representations that arise from training a deep network to predict AVFs (denotedAVF from here on). We sample k = 1000 interest functions and use Algorithm 1 to generate k AVFs.

6

We combine these AVFs into the representation loss (4) and adapt the parameters of the deep networkusing Rmsprop (Tieleman and Hinton, 2012).

We contrast the AVF-driven representation with one learned by predicting the value function ofrandom deterministic policies (RP). Specifically, these policies are generated by assigning an actionuniformly at random to each state. We also consider the value function of the uniformly randompolicy (VALUE). While we make these choices here for concreteness, other experiments yieldedsimilar results (e.g. predicting the value of the optimal policy; appendix, Figure 8). In all cases, welearn a d = 16 dimensional representation, not including the bias unit.

Value function Random policies (k=1000) AVFs (k=1000)

Figure 3: 16-dimensional representations learned by predicting a single value function, the valuefunctions of 1000 random policies, or 1000 AVFs sampled using Algorithm 1. Each panel elementdepicts the activation of a given feature across states, with blue/red indicating low/high activation.

Figure 3 shows the representations learned by the three methods. The features learned by VALUEresemble the value function itself (top left feature) or its negated image (bottom left feature). Coarselyspeaking, these features capture the general distance to the goal but little else. The features learnedby RP are of even worse quality. This is because almost all random deterministic policies causethe agent to avoid the goal (appendix, Figure 12). The representation learned by AVF, on the otherhand, captures the structure of the domain, including paths between distal states and focal pointscorresponding to rooms or parts of rooms.

Although our focus is on the use of AVFs as auxiliary tasks to a deep network, we observe thesame results when discovering a representation using singular value decomposition (Section 3.2), asdescribed in Appendix I. All in all, our results illustrate that, among all value functions, AVFs areparticularly useful auxiliary tasks for representation learning.

4.3 Learning the Optimal Policy

Figure 4: Average discounted return achieved bypolicies learned using a representation produced byVALUE, AVF, or PVF. Average is over 20 randomseeds and shading gives standard deviation.

In a final set of experiments, we consider learn-ing a reward-maximizing policy using a pre-trained representation and a model-based ver-sion of the SARSA algorithm (Rummery andNiranjan, 1994; Sutton and Barto, 1998). Wecompare the value-based and AVF-based rep-resentations from the previous section (VALUEand AVF), and also proto-value functions (PVF;details in Appendix H.3).

We report the quality of the learned policies aftertraining, as a function of d, the size of the rep-resentation. Our quality measures is the averagereturn from the designated start state (bottomleft). Results are provided in Figure 4 and Fig-ure 13 (appendix). We observe a failure of theVALUE representation to provide a useful basisfor learning a good policy, even as d increases;while the representation is not rank-deficient, the features do not help reduce the approximation error.

7

In comparison, our AVF representations perform similarly to PVFs. Increasing the number of auxiliarytasks also leads to better representations; recall that PVF implicitly uses n = 104 auxiliary tasks.

5 Related Work

Our work takes inspiration from earlier research in basis or feature construction for reinforcementlearning. Ratitch and Precup (2004), Foster and Dayan (2002), Menache et al. (2005), Yu andBertsekas (2009), Bhatnagar et al. (2013), and Song et al. (2016) consider methods for adaptingparametrized basis functions using iterative schemes. Including Mahadevan and Maggioni (2007)’sproto-value functions, a number of works (we note Dayan, 1993; Petrik, 2007; Mahadevan and Liu,2010; Ruan et al., 2015; Barreto et al., 2017) have used characteristics of the transition structure ofthe MDP to generate representations; these are the closest in spirit to our approach, although noneuse the reward or consider the geometry of the space of value functions. Parr et al. (2007) proposedconstructing a representation from successive Bellman errors, Keller et al. (2006) used dimensionalityreduction methods; finally Hutter (2009) proposes a universal scheme for selecting representations.

Deep reinforcement learning algorithms have made extensive use of auxiliary tasks to improveagent performance, beginning perhaps with universal value function approximators (Schaul et al.,2015) and the UNREAL architecture (Jaderberg et al., 2017); see also Dosovitskiy and Koltun(2017), Francois-Lavet et al. (2018) and, more tangentially, van den Oord et al. (2018). Levineet al. (2017) and Chung et al. (2019) make explicit use of two-part approximations to derive moresample efficient deep reinforcement learning algorithms. The notion of augmenting an agent withside predictions regarding the world is not new, with roots in TD models (Sutton, 1995), predictivestate representations (Littman et al., 2002) and the Horde architecture (Sutton et al., 2011), whichitself is based on Selfridge (1959)’s Pandemonium architecture.

In passing, we remark on a number of works which aim to quantify or explain the usefulness of arepresentation. Parr et al. (2008) studies the particular case of linear representations, while Li et al.(2006); Abel et al. (2016) consider the approximation error that arises from state abstraction. Morerecently, Nachum et al. (2019) provide some interesting guarantees in the context of hierarchicalreinforcement learning, while Such et al. (2019) visualizes the representations learned by Atari-playing agents. Finally, Bertsekas (2018) remarks on the two-part approximation we study here.

6 Conclusion

In this paper we studied the notion of an adversarial value function, derived from a geometricperspective on representation learning in RL. Our work shows that adversarial value functions exhibitinteresting structure, and are good auxiliary tasks when learning a representation of an environment.We believe our work to be the first to provide formal evidence as to the usefulness of predicting valuefunctions for shaping an agent’s representation.

Our work opens up the possibility of automatically generating auxiliary tasks in deep reinforcementlearning, analogous to how deep learning itself enabled a move away from hand-crafted features.A number of practical considerations remain to be addressed. First, our sampling procedure isclearly inefficient, and may be improved by encouraging diversity within AVFs. Second, practicalimplementations require learning the AVFs concurrently with the main task. Doing results in off-policy learning, whose negative effects are well-documented even in recent applications. (e.g. vanHasselt et al., 2018). Third, interest functions in large domains should incorporate some degree ofsmoothness, rather than vary rapidly from state to state.

From a mathematical perspective, our formulation of the representation learning problem (1) wasmade with both convenience and geometry in mind. Conceptually, it may be interesting to considerour approach in other norms, including the weighted norms used in approximation results.

7 Acknowledgements

The authors thank the many people who helped shape this project through discussions and feedbackon early drafts: Lihong Li, George Tucker, Doina Precup, Ofir Nachum, Csaba Szepesvari, GeorgOstrovski, Marek Petrik, Marlos Machado, Tim Lillicrap, Danny Tarlow, Saurabh Kumar, and Carles

8

Gelada. Special thanks also to Philip Thomas and Scott Niekum, who gave this project its initialimpetus.

8 Author Contributions

M.G.B., W.D., D.S., and N.L.R. conceptualized the representation learning problem. M.G.B., W.D.,T.L., A.A.T., R.D., D.S., and N.L.R. contributed to the theoretical results. M.G.B., W.D., P.S.C., R.D.,and C.L. performed experiments and collated results. All authors contributed to the writing.

ReferencesAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G.,

Isard, M., et al. (2016). Tensorflow: A system for large-scale machine learning. In Symposium onOperating Systems Design and Implementation.

Abel, D., Hershkowitz, D. E., and Littman, M. L. (2016). Near optimal behavior via approximatestate abstraction. In Proceedings of the International Conference on Machine Learning.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin,J., Abbeel, O. P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in NeuralInformation Processing Systems.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017).Successor features for transfer in reinforcement learning. In Advances in Neural InformationProcessing Systems.

Behzadian, B. and Petrik, M. (2018). Feature selection by singular value decomposition for reinforce-ment learning. In Proceedings of the ICML Prediction and Generative Modeling Workshop.

Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcementlearning. In Proceedings of the International Conference on Machine Learning.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The Arcade Learning Environment:An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.

Bellman, R. E. (1957). Dynamic programming. Princeton University Press, Princeton, NJ.

Bernhard, K. and Vygen, J. (2008). Combinatorial optimization: Theory and algorithms. Springer,Third Edition, 2005.

Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. II: Approximate DynamicProgramming. Athena Scientific.

Bertsekas, D. P. (2018). Feature-based aggregation and deep reinforcement learning: A survey andsome new implementations. Technical report, MIT/LIDS.

Bhatnagar, S., Borkar, V. S., and Prabuchandran, K. (2013). Feature search in the Grassmanian inonline reinforcement learning. IEEE Journal of Selected Topics in Signal Processing.

Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. (2018). Dopamine: A researchframework for deep reinforcement learning. arXiv.

Chung, W., Nath, S., Joseph, A. G., and White, M. (2019). Two-timescale networks for nonlinearvalue function approximation. In International Conference on Learning Representations.

Dadashi, R., Taıga, A. A., Roux, N. L., Schuurmans, D., and Bellemare, M. G. (2019). The valuefunction polytope in reinforcement learning. arXiv.

Dayan, P. (1993). Improving generalisation for temporal difference learning: The successor represen-tation. Neural Computation.

9

Dosovitskiy, A. and Koltun, V. (2017). Learning to act by predicting the future. In Proceedings of theInternational Conference on Learning Representations.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6:503–556.

Farahmand, A., Ghavamzadeh, M., Szepesvari, C., and Mannor, S. (2016). Regularized policyiteration with nonparametric function spaces. Journal of Machine Learning Research.

Foster, D. and Dayan, P. (2002). Structure in the space of value functions. Machine Learning.

Francois-Lavet, V., Bengio, Y., Precup, D., and Pineau, J. (2018). Combined reinforcement learningvia abstract representations. arXiv.

Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. (2019). DeepMDP: Learningcontinuous latent space modelsfor representation learning. In Proceedings of the InternationalConference on Machine Learning.

Hutter, M. (2009). Feature reinforcement learning: Part I. Unstructured MDPs. Journal of ArtificialGeneral Intelligence.

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu,K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of theInternational Conference on Learning Representations.

Keller, P. W., Mannor, S., and Precup, D. (2006). Automatic basis function construction for approx-imate dynamic programming and reinforcement learning. In Proceedings of the InternationalConference on Machine Learning.

Lagoudakis, M. and Parr, R. (2003). Least-squares policy iteration. The Journal of Machine LearningResearch.

Levine, N., Zahavy, T., Mankowitz, D., Tamar, A., and Mannor, S. (2017). Shallow updates for deepreinforcement learning. In Advances in Neural Information Processing Systems.

Li, L., Walsh, T., and Littman, M. (2006). Towards a unified theory of state abstraction for MDPs. InProceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics.

Liang, Y., Machado, M. C., Talvitie, E., and Bowling, M. H. (2016). State of the art control of atarigames using shallow reinforcement learning. In Proceedings of the International Conference onAutonomous Agents and Multiagent Systems.

Littman, M. L., Sutton, R. S., and Singh, S. (2002). Predictive representations of state. In Advancesin Neural Information Processing Systems.

Machado, M. C., Bellemare, M. G., and Bowling, M. (2017). A Laplacian framework for optiondiscovery in reinforcement learning. In Proceedings of the International Conference on MachineLearning.

Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. (2018). Eigenop-tion discovery through the deep successor representation. In Proceedings of the InternationalConference on Learning Representations.

Mahadevan, S. and Liu, B. (2010). Basis construction from power series expansions of value functions.In Advances in Neural Information Processing Systems.

Mahadevan, S. and Maggioni, M. (2007). Proto-value functions: A Laplacian framework for learningrepresentation and control in Markov decision processes. Journal of Machine Learning Research.

Menache, I., Mannor, S., and Shimkin, N. (2005). Basis function adaptation in temporal differencereinforcement learning. Annals of Operations Research.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deepreinforcement learning. Nature, 518(7540):529–533.

10

Munos, R. (2003). Error bounds for approximate policy iteration. In Proceedings of the InternationalConference on Machine Learning.

Munos, R. (2007). Performance bounds in l p-norm for approximate value iteration. SIAM Journalon Control and Optimization.

Nachum, O., Gu, S., Lee, H., and Levine, S. (2019). Near-optimal representation learning forhierarchical reinforcement learning. In Proceedings of the International Conference on LearningRepresentations.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. (2008). An analysis of linearmodels, linear value-function approximation, and feature selection for reinforcement learning. InProceedings of the International Conference on Machine Learning.

Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. (2007). Analyzing feature generationfor value-function approximation. In Proceedings of the International Conference on MachineLearning.

Petrik, M. (2007). An analysis of Laplacian methods for value function approximation in MDPs. InProceedings of the International Joint Conference on Artificial Intelligence.

Petrik, M. and Zilberstein, S. (2011). Robust approximate bilinear programming for value functionapproximation. Journal of Machine Learning Research.

Puterman, M. L. (1994). Markov Decision Processes: Discrete stochastic dynamic programming.John Wiley & Sons, Inc.

Ratitch, B. and Precup, D. (2004). Sparse distributed memories for on-line value-based reinforcementlearning. In Proceedings of the European Conference on Machine Learning.

Rockafellar, R. T. and Wets, R. J.-B. (2009). Variational analysis. Springer Science & BusinessMedia.

Ruan, S. S., Comanici, G., Panangaden, P., and Precup, D. (2015). Representation discovery for mdpsusing bisimulation metrics. In Proceedings of the AAAI Conference on Artificial Intelligence.

Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technicalreport, Cambridge University Engineering Department.

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal ofResearch and Development.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. InProceedings of the International Conference on Machine Learning.

Selfridge, O. (1959). Pandemonium: A paradigm for learning. In Symposium on the mechanizationof thought processes.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner,N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016).Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489.

Solway, A., Diuk, C., Cordova, N., Yee, D., Barto, A. G., Niv, Y., and Botvinick, M. M. (2014).Optimal behavioral hierarchy. PLOS Computational Biology.

Song, Z., Parr, R., Liao, X., and Carin, L. (2016). Linear feature encoding for reinforcement learning.In Advances in Neural Information Processing Systems.

Stachenfeld, K. L., Botvinick, M., and Gershman, S. J. (2014). Design principles of the hippocampalcognitive map. In Advances in Neural Information Processing Systems.

Such, F. P., Madhavan, V., Liu, R., Wang, R., Castro, P. S., Li, Y., Schubert, L., Bellemare, M. G.,Clune, J., and Lehman, J. (2019). An Atari model zoo for analyzing, visualizing, and comparingdeep reinforcement learning agents. In Proceedings of the International Joint Conference onArtificial Intelligence.

11

Sutton, R., Modayil, J., Delp, M., Degris, T., Pilarski, P., White, A., and Precup, D. (2011). Horde: Ascalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction.In Proceedings of the International Conference on Autonomous Agents and Multiagents Systems.

Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Proceedings ofthe International Conference on Machine Learning.

Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparsecoarse coding. In Advances in Neural Information Processing Systems.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.

Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem ofoff-policy temporal-difference learning. Journal of Machine Learning Research.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methodsfor reinforcement learning with function approximation. In Advances in Neural InformationProcessing Systems.

Sutton, R. S., Precup, D., and Singh, S. P. (1999). Between MDPs and semi-MDPs: A framework fortemporal abstraction in reinforcement learning. Artificial Intelligence.

Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM,38(3).

Tieleman, T. and Hinton, G. (2012). RmsProp: Divide the gradient by a running average of its recentmagnitude. COURSERA: Neural Networks for Machine Learning.

Tosatto, S., Pirotta, M., D’Eramo, C., and Restelli, M. (2017). Boosted fitted q-iteration. InProceedings of the International Conference on Machine Learning.

van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictivecoding. In Advances in Neural Information Processing Systems.

van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. (2018). Deepreinforcement learning and the deadly triad. arXiv.

Wu, Y., Tucker, G., and Nachum, O. (2019). The laplacian in rl: Learning representations with efficientapproximations. In Proceedings of the International Conference on Learning Representations (toappear).

Yu, H. and Bertsekas, D. P. (2009). Basis function adaptation methods for cost approximation in mdp.In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and ReinforcementLearning.

12

A Proof of Lemma 1

Consider the value polytope V . We have using Corollary 1 of Dadashi et al. (2019) that

V ⊆ CONV(V) = CONV(V π1 , . . . , V πm), (5)

where π1, . . . , πm is a finite collection of deterministic policies. We assume that this set of policies isof minimal cardinality e.g. the value functions V π1 , . . . , V πm are distinct.

The optimization problem maxV ∈V

δ>V is equivalent to the linear program maxV ∈CONV(V)

δ>V , and the

maximum is reached at a vertex U of the convex hull of V (Boyd and Vandenberghe, 2004). By(5), U is the value function of a deterministic policy. Now consider δ ∈ Rn such that fδ attains itsmaximum over multiple elements of the convex hull. By hypothesis, there must be two policies πi,πj such that V πi 6= V πj and

maxV ∈V

δ>V = δ>V πi = δ>V πj ,

and thusδ>(V πi − V πj ) = 0. (6)

Write ∆ for the ensemble of such δ. We have from (6):

∆ ⊆⋃

1≤i<j≤m

{δ ∈ Rn | δT (Vπi − Vπj ) = 0}.

As V π1 , . . . , V πm are distinct, ∆ is included in a finite union of hyperplanes (recall that hyperplanesof Rn are vector spaces of dimension n− 1). The Lebesgue measure of a hyperplane is 0 (in Rn),hence a finite union of hyperplanes also has Lebesgue measure 0. Hence ∆ itself has Lebesguemeasure of 0 in Rn.

B Proof of Corollary 2

Similarly to the proof of Lemma 1, we introduce V π1 , . . . , V πm which are the distinct vertices of theconvex hull of the value polytope V . Note that π1, . . . , πm are deterministic policies. We shall showthat there are at most 2n such vertices.

Recall the definition of a cone in Rn: C is a cone in Rn if ∀v ∈ C, ∀α ≥ 0, αv ∈ C. For each vertexV πi , Rockafellar and Wets (2009, Theorem 6.12) states that there is an associated cone Ci of nonzeroLebesgue measure in Rn such that

∀δ ∈ Ci, arg maxV ∈V

δ>V = V πi .

Now using Theorem 2, we have

maxV ∈V

δ>V = maxπ∈P

d>π r, where dπ = (I − γPπ>)−1δ.

For all δ ∈ Ci the corresponding policy πi is the same (by hypothesis). For such a δ, definedπi,δ := (I − γPπi>)−1δ, such that

δ>V πi = d>πi,δr.

Because Ci is a cone of nonzero Lebesgue measure in Rn, we have span(Ci) = Rn. Combinedwith the fact that (I − γPπi>)−1 is full rank, this implies we can find a direction δi in Ci for whichdπi,δi(x) 6= 0 for all x ∈ X . For this δi, using Theorem 2 we have:

V πi(x) = r(x) + γ

{maxa∈A Ex′∼P V πi(x′) dπi,δi(x) > 0,mina∈A Ex′∼P V πi(x′) dπi,δi(x) < 0,

(7)

and each state is “strictly” a maximizer or minimizer (the purpose of our cone argument was to avoidthe undefined case where dπi,δi(x) = 0). Now define σi ∈ {−1, 1}n, σi(x) = sign(dπi,δi(x)). Wehave:

V πi(x) = r(x) + γσi(x) maxa∈A

σi(x) Ex′∼P

V πi(x′)

= TσiV πi(x)

13

where TσV (x) = r(x) + γσ(x) maxa∈A σ(x)Ex′∼P V (x′) for σ ∈ {−1, 1}n. We show that Tσ isa contraction mapping: for any x ∈ X and σ ∈ {−1, 1}n,

|TσV1(x)− TσV2(x)| = |r(x) + γσ(x) maxa∈A

σ(x) Ex′∼P

V1(x′)− r(x)− γσ(x) maxa∈A

σ(x) Ex′∼P

V2(x′)|

= γ|maxa∈A

σ(x) Ex′∼P

V1(x′)−maxa∈A

σ(x) Ex′∼P

V2(x′)|

≤ γmaxa∈A|σ(x) E

x′∼PV1(x′)− σ(x) E

x′∼PV2(x′)|

≤ γmaxa∈A

maxx′∈X

|V1(x′)− V2(x′)|

= γ maxx′∈X

|V1(x′)− V2(x′)|.

Therefore, ‖TσV1 − TσV2‖∞ ≤ γ‖V1 − V2‖∞ and Tσ is a γ-contraction in the supremum norm. ByBanach’s fixed point theorem V πi is its a unique fixed point.

We showed that each vertex V πi of the value function polytope V is the fixed point of an operatorTσi . Since there are 2n such operators, there are at most 2n vertices.

C Proof of Theorem 1

We will show that the maximization over P is the same as the maximization over Pv .

Let Πφ be the projection matrix onto the hyperplane H spanned by the basis induced by φ. We write∥∥V πφ − V π∥∥2

2=∥∥ΠφV

π − V π∥∥2

2

=∥∥(Πφ − I)V π

∥∥2

2

= V π>(Πφ − I)>(Πφ − I)V π

= V π>(I −Πφ)V π

because Πφ is idempotent. The eigenvalues of A = I − Πφ are 1 and 0, and the eigenvectorscorresponding to eigenvalue 1 are normal to H . Because we are otherwise free to choose any basisspanning the subspace normal to H , there is a unit vector δ normal to H for which

maxπ

∥∥V πφ − V π∥∥2

2= maxV π∈V

∥∥V πφ − V π∥∥2

2

= maxV π∈V

V π>δδ>V π.

Denote the value function maximizing this quantity by V πMAX. This δ can be chosen so that δ>V πMAX > 0(if not, take δ′ = −δ). Then V πMAX is also the maximizer of f(V ) := δ>V over V , and Lemma 1 tellsus that V πMAX is an extremal vertex.

D Proof of Theorem 2

To begin, note that

δ>V π = δ>(I − γPπ)−1r

= (I − γPπ>)−1δ>r

= d>π r,

as required.

Now, we choose an indexing for states in S and will refer to states by their index.

Let π be the policy maximizing δ>V π and consider some x∗ ∈ S. We assume without loss ofgenerality that x∗ is the first state in the previous ordering. Recall that n = |S|.The theorem states that policy π chooses the highest-valued action at x∗ if dπ(x∗) > 0, and thelowest-valued action if dπ(x∗) < 0. Writing Pπx∗ := Pπ(· |x∗) for conciseness, this is equivalent to

r(x∗) + Ex′∼Pπ

x∗V π(x′) = max

π′r(x∗) + E

x′∼Pπ′x∗

V π(x′),

14

for dπ(x∗) > 0, and conversely with a minπ′ for dπ(x∗) < 0 (equivalently, T πV π(x∗) ≥T π′V π(x∗) for all π′ ∈ P or T πV π(x∗) ≤ T π′V π(x∗) in operator notation).

We write the transition matrix Pπ as follows

Pπ =

Lπ1...Lπn

.

Where Lπi =(Pπ(x1 |xi), · · · , Pπ(xn |xi)

)is Pπ’s i-th row.

Then we express the transition matrix as Pπ = Aπ +Bπ , with Aπ and Bπ given by

Aπ =

0Lπ2...Lπn

Bπ =

Lπ10...0

.

We can then write

V π = r + γPπV π

= r + γ(Aπ +Bπ)V π

⇒ V π = (I − γAπ)−1(r + γBπV π).

This is an application of matrix splitting (e.g Puterman, 1994). The invertibility of (I − γAπ) isguaranteed because Aπ is a substochastic matrix. The first term of the r.h.s corresponds to theexpected sum of discounted rewards when following π until reaching x∗, while the second term isthe expected sum of discounted rewards received after leaving from x∗ and following policy π.

Note that (I − γAπ)−1 does not depend on π(· |x∗) and that

BπV π =

Ex′∼Pπ

x∗V π(x′)

0...0

.

Write Cπ = (I − γAπ>)−1δ. We have

δ>V π = δ>(I − γAπ)−1(r + γBπV π)

= Cπ>(r + γBπV π)

= Cπ>r + Cπ(x∗) Ex′∼Pπ

x∗V π(x′).

Now by assumption,

δ>V π ≥ δ>V π′ (8)

for any other policy π′ ∈ P . Take π′ such that π′(· |x) = π(· |x) everywhere but x∗; then Cπ = Cπ′

and (8) implies that

Cπ(x∗) Ex′∼Pπ

x∗V π(x′) ≥ Cπ(x∗) E

x′∼Pπ′x∗

V π(x′).

Hence π must pick the maximizing action in x∗ if Cπ(x∗) > 0, and the minimizing action ifCπ(x∗) < 0.

To conclude the proof, we show that dπ(x∗) and Cπ(x∗) have the same sign. We write

dπ = δ + γ(Aπ> +Bπ>)dπ.

15

Then(I − γAπ>)dπ = δ + γBπ>dπ

⇒ dπ = Cπ + γ(I − γAπ>)−1Bπ>dπ

=

∞∑k=0

(γ(I − γAπ>)−1Bπ>)kCπ

=

∞∑k=0

γk(Dπ>)kCπ.

Where Dπ = Bπ(I − γAπ)−1 is a matrix with non-negative components. Because Bπ is sparseevery row of (Dπ)k is null except for the first one. We can write

(Dπk)> =

dk11 0 · · · 0

...dk1n 0 · · · 0

∀i, dk1i ≥ 0.

And

dπ(x∗) =( ∞∑k=0

γkdk11

)Cπ(x∗).

Hence Cπ(x∗) and dπ(x∗) have the same sign.

E Proof of Theorem 3

We first transform (3) in a equivalent problem. Let V ∈ Rn, and denote by Vφ := ΠφV the orthogonalprojection of V onto the subspace spanned by the columns of Φ. From Pythagoras’ theorem we have,for any V ∈ Rn ∥∥V ∥∥2

2=∥∥Vφ − V ∥∥2

2+∥∥Vφ∥∥2

2

Thenminφ∈R

EV∼ξ

∥∥Vφ − V ∥∥2

2= minφ∈R

EV∼ξ

[∥∥V ∥∥2

2−∥∥ΠφV

∥∥2

2

]= max

φ∈REV∼ξ

∥∥ΠφV∥∥2

2.

Let u∗1, . . . , u∗d the principal components defined in Theorem 3. These form an orthonormal basis.

Hence u∗1, . . . , u∗d is equivalently a solution of

maxu1,...,ud∈Rn

orthonormal

d∑i=1

EV∼ξ

(u>i V )22 = max

u1,...,ud∈Rnorthonormal

d∑i=1

EV∼ξ

∥∥u>i V ui∥∥2

2

= maxu1,...,ud∈Rn

orthonormal

EV∼ξ

∥∥ d∑i=1

u>i V ui∥∥2

2

= maxΦ=[u1,...,ud]

Φ>Φ=I

EV∼ξ

∥∥ΦΦTV∥∥2

= maxφ∈R

EV∼ξ

∥∥ΠφV∥∥2

2.

Which gives the desired result. The equivalence with the eigenvectors of Eξ V V > follows fromwriting

EV∼ξ

(u>V )22 = E

V∼ξu>V V >u

= u> EV∼ξ

[V V >

]u

and appealing to a Rayleigh quotient argument, since we require u∗i to be of unit norm.

16

F The Optimization Problem (1) as a Quadratic Program

Proposition 1. The optimization problem (1) is equivalent to a quadratic program with quadraticconstraints.

Proof. For completeness, let n, d be the number of states and features, respectively. We considerrepresentations Φ ∈ Rn×d. Recall that Πφ is the projection operator onto the subspace spanned by Φ,that is

Πφ = Φ(Φ>Φ

)−1Φ>.

We will also write Pd for the space of deterministic policies. We write (1) in epigraph form (Boydand Vandenberghe, 2004):

min. maxπ

∥∥ΠφVπ − V π

∥∥2

2⇔

min. maxπ∈Pd

∥∥ΠφVπ − V π

∥∥2

2⇔

min. t s.t.∥∥ΠφV

π − V π∥∥2

2≤ t ∀π ∈ Pd.

The first equivalence comes from the fact that the extremal vertices of our polytope are achieved bydeterministic policies. The norm in the constraint can be written as∥∥ΠφV

π − V π∥∥2

2=∥∥(Πφ − I)V π

∥∥2

2

= V π>(Πφ − I)>(Πφ − I)V π

= V π>(Πφ − I)>(Πφ − I)V π

(a)= V π>(Π2

φ − 2Πφ + I)V π

(b)= V π>(I −Πφ)V π,

where (a) and (b) follow from the idempotency of Πφ. This is∥∥ΠφVπ − V π

∥∥2

2= V π>

(I − Φ(Φ>Φ)−1Φ>

)V π.

To make the constraint quadratic, we further require that the representation be left-orthogonal:Φ>Φ = I . Hence the optimization problem (1) is equivalent to

minimize t s.t.

V π>(I − ΦΦ>)V π ≤ t ∀π ∈ PdΦ>Φ = I.

From inspection, these constraints are quadratic.

However, there are an exponential number of deterministic policies and hence, an exponential numberof constraints in our optimization problem.

G NP-hardness of Finding AVFs

Proposition 2. Finding maxπ∈Pd δ>V π is NP-hard, where the input is a deterministic MDP with

binary-valued reward function, discount rate γ = 1/2 and δ : X → {−1/4, 0, 1}.

We use a reduction from the optimization version of minimum set cover, which is known to beNP-hard (Bernhard and Vygen, 2008, Corollary 15.24). Let n and m be natural numbers. An instanceof set cover is a collection of sets C = {C1, . . . , Cm} where Ci ⊆ [n] = {1, 2, . . . , n} for all i ∈ [m].The minimum set cover problem is

minJ⊆[m]

|J | : ⋃j∈J

Cj = [n]

.

17

Given a Markov decision process 〈X ,A, r, P, γ〉 and function δ : X → [−1, 1] define

R(π) =∑x∈X

δ(x)V π(x) .

We are interested in the optimization problem

maxπ∈Pd

R(π) . (9)

When δ(x) ≥ 0 for all x this corresponds to finding the usual optimal policy, which can be foundefficiently using dynamic programming. The propositions claims that more generally the problem isNP-hard.

Consider an instance of set cover C = {C1, . . . , Cm} over universe [n] with m > 1. Define adeterministic MDP 〈X ,A, r, P, γ〉 with γ = 1/2 and n+m+ 2 states and at most m actions. Thestate space is X = X1 ∪ X2 ∪ X3 where

X1 = {u1, . . . , un} X2 = {v1, . . . , vm} X3 = {g, b} .The reward function is r(x) = I[x=g]. The transition function in a deterministic MDP is characterizedby a function mapping states to the set of possible next states:

N(x) =⋃a∈A{x′ : P (x′ |x, a) = 1} .

We use C to choose P as a deterministic transition function for which

N(x) =

{x} if x ∈ X3

{g, b} if x ∈ X2

{vj : i ∈ Cj} if x = ui ∈ X1 .

This means the states in X3 are self transitioning and states in X2 have transitions leading to eitherstate in X3. States in X1 transition to states in X2 in a way that depends on the set cover instance.The situation is illustrated in Figure 5. Since both policies and the MDP are deterministic, we canrepresent a policy as a function π : X → X for which π(x) ∈ N(x) for all x ∈ X . To see theconnection to set cover, notice that ⋃

vj∈π(X1)

Cj = [n] , (10)

where π(X1) = {π(x) : x ∈ X1}. Define

δ(x) =

1 if x ∈ X1

−1/4 if x ∈ X2

0 if x ∈ X3 .

Using the definition of the value function and MDP,

R(π) =∑x∈X

δ(x)V π(x)

=∑x∈X1

V π(x)− 1

4

∑x∈X2

V π(x)

=∑x∈X1

V π(x)− 1

4

∑x∈X2

I[π(x)=g]

=1

2

∑x∈X1

I[π(π(x))=g] −1

4

∑x∈X2

I[π(x)=g] .

The decomposition shows that any policy maximizing (9) must satisfy π(π(X1)) = {g} and π(X2 \π(X1)) = {b} and for such policies

R(π) =1

2

(n− 1

2|π(X1)|

).

18

In other words, a policy maximizing (9) minimizes |π(X1)|, which by (10) corresponds to finding aminimum set cover. Rearranging shows that

minJ⊆[m]

|J | : ⋃j∈J

Cj = [n]

= 2n− 4 maxπ∈Pd

R(π) .

The result follows by noting this reduction is clearly polynomial time.

X1

X2

X3

u1 u2 u3 u4

v1 v2 v3

g b

Figure 5: The challenging MDP given set cover problem {{1, 2, 3}, {1, 4}, {4}}. State g gives areward of 1 and all other states give reward 0. The optimal policy is to find the smallest subset of themiddle layer such that for every state in the bottom layer there exists a transition to the subset.

H Empirical Studies: Methodology

H.1 Four-room Domain

The four-room domain consists of 104 discrete states arranged into four “rooms”. There are fouractions available to the agent, transitioning deterministically from one square to the next; whenattempting to move into a wall, the agent remains in place. In our experiments, the top right state is agoal state, yielding a reward of 1 and terminating the episode; all other transitions have 0 reward.

H.2 Learning φ

Our representation φ consists of a single hidden layer of 512 rectified linear units (ReLUs) followedby a layer of d ReLUs which form our learned features. The use of ReLUs has an interestingside effect that all features are nonnegative, but other experiments with linear transforms yieldedqualitatively similar results. The input is a one-hot encoding of the state (a 104-dimensional vector).All layers (and generally speaking, experiments) also included a bias unit.

The representation was learned using standard deep reinforcement learning tools taken from theDopamine framework (Castro et al., 2018). Our loss function is the mean squared loss w.r.t. thetargets, i.e. the AVFs or the usual value function. The losses were then trained using RMSProp with astep size of 0.00025 (the default optimizer from Dopamine), for 200,000 training updates each over aminibatch of size 32; empirically, we found our results robust to small changes in step sizes.

In our experiments we optimize both parts of the two-part approximation defined by φ and θsimultaneously, with each prediction made as a linear combination of features φ(x)>θi and replacingL(φ;µ) from (4) with a sample-based estimate. This leads to a slightly different optimizationprocedure but with similar representational characteristics.

H.3 Implementation Details: Proto-Value Functions

Our PVF representation consists in the top k left-singular vectors of the successor representation(I − γPπ)−1 for π the uniformly random policy, as suggested by Machado et al. (2018); Behzadianand Petrik (2018). See Figure 9 for an illustration.

19

H.4 Learning AVFs

The AVFs were learned from 1000 policy gradient steps, which were in general sufficient forconvergence to an almost-deterministic policy. This policy gradient scheme was defined by directlywriting the matrix (I − γPπ)−1 as a Tensorflow op (Abadi et al., 2016) and minimizing −δ>(I −γPπ)−1r w.r.t. π. We did not use an entropy penalty. In this case, there is no approximation: theAVF policies are directly represented as matrices of parameters of softmax policies.

H.5 SARSA

In early experiments we found LSPI and fitted value iteration to be somewhat unstable and eventuallyconverged on a relatively robust, model-based variant of SARSA.

In all cases, we define the following dynamics. We maintain an occupancy vector d over thestate space. At each time step we update this occupancy vector by applying one transition in theenvironment according to the current policy π, but also mix in a probability of resetting to a stateuniformly at random in the environment:

d = 0.99dPπ + 0.01Unif(X )

The policy itself is an ε-greedy policy according to the current Q-function, with ε = 0.1.

We update the Q-function using a semi-gradient update rule based on expected SARSA (Suttonand Barto, 1998), but where we simultaneously compute updates across all states and weight themaccording to the occupancy d. We use a common step size of 0.01 but premultiplied the updates bythe pseudoinverse of Φ>Φ to deal with variable feature shapes across methods. This process wasapplied for 50,000 training steps, after which we report performance as the average value and/ornumber of steps to goal for the 10 last recorded policies (at intervals of 100 steps each).

Overall, we found this learning scheme to reduce experimental variance and to be robust to off-policy divergence, which we otherwise observed in a number of experiments involving value-onlyrepresentations.

I Representations as Principal Components of Sets of Value Functions

In the main text we focused on the use of value functions as auxiliary tasks, which are combinedinto the representation loss (4). However, Section 3.2 shows that doing so is equivalent (in intent)to computing the principal components of a particular set of value functions, where each “column”corresponds to a particular auxiliary task.

In Figure 10 we show the representations generated from this process, using different sets of valuefunctions. For completeness, we consider:

• 1000 AVFs,• 1000 random deterministic policies (RDPs),• 1000 random stochastic policies (RSPs), and• The 104 rows of the successor matrix (corresponding to proto-value functions).

As with principal component analysis, the per-state feature activations are determined up to a signedscalar; we pick the vector which has more positive components than negative. In all but the PVF case,we sample a subset of the many possible value functions within a particular set. Figure 11 shows thatthe AVF approach is relatively robust to the sample size.

The AVFs are sampled using Algorithm 1, i.e. by sampling a random interest function δ ∈ [−1, 1]n

and using policy gradient on a softmax policy to find the corresponding value function. The randompolicies were generated by randomly initializing the same softmax policy and using them as-is (RSPs)or multiplying the logits by 1e6 (RDPs).

20

Interest function ẟ

PolicyAdversarial value function

d⇡<latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit>

Network flow

� 6= d⇡<latexit sha1_base64="0Bu8uDU59fMXNw/eLFexobUvGOg=">AAAIY3icfZXrbts2FMfV7hIv3da03bdhADejwFAYhpQ1adphQLt66IYlmZMmaZrIMCiKsomQkkbSnR1Wn/c0+7o9yx5g77FDSXZ0aSrA1tH5/Q95eHgLUs6Udt1/b9z84MOPPl7rfLJ+69PPPr+9cefuiUpmktBjkvBEngZYUc5ieqyZ5vQ0lRSLgNNXwcVzy1+9oVKxJD7Si5SOBJ7ELGIEa3CNN772NZ3rvB0jaZgZP6RcY+THFIVjP2XZeKPr9h/vbG8+3kRu3916uLXpgeFtu9vb3yGv7+ZP1ymf4fjO2gM/TMhM0FgTjpU699xUjwyWmhFOs3V/pmiKyQWe0HMwYyyoGpk8hwzdB0+IokTCL9Yo91YjDBZKLUQASoH1VDWZdV7H9FTUejdWJVWkGjnpaGdkWJzONI1JkVI040gnyJYQhUxSovkCDEwkg1EhMsUSEw2FrvfJJwkImt2u3IzUgaRcsctGicI3LFVlkeZFlWpBAbS+DvP1B0mEwHFofPIS5tGOjWBuXmZZg55W6GmLvq7Q1y36rEKftehhhR626KBCBy16VqFnLbpXoXstelShRy26W6G7WbNY6qCCD5rB6qRCT1rBwX6Jg8DsN2NTKkWJFZFm2IomAZYg6L31e82cwxL572CKCmbZ9xbV2TzOzHxsvCex7ew+bOYI+XSewnpFq0R/sokOKOxNSffA9RskinUiH5hSmplCC4IJlcvACEFgI5dCr7Kl5ccJZ4KBBxLLO5fT5BJagNfY5GeNIuYyb2eJVROrGtYl1kvfpcBzZM6u9PB9FRFecGQGV/DX3WxZdz2liaTCcArTskthCFnNL63/kAosL+qAwJYzzxOZcKjIos7s5jZHxUdzNuIEFsC5NzI+p5FG/lvU9WBMbDK1H41Ssjh6v35sJXrRCgvhFMmDyun9JRuXDZxX4kdZOSVwURTHfQjjzI98OQky4/Z3enC899ysrZpISuOVDjT9LattZBLwGS0SubpScp/pes11Y/ttSFfptNVF/+/Ql4nlEevXrWlYxIKBCt5+z1rvE8JSKoVgLdfw8AUqepbCDF+sltp0H9xTrNH+0jOw/8ZP7I1LNRxy5skPK/mA/l5nFsHturxC0fXGyWbfA/vgYffpj+U923G+dL5xvnU855Hz1PnZGTrHDnH+dP5y/nb+Wfuvc6tzt/NFIb15o4y559Sezlf/A2TtD1g=</latexit><latexit sha1_base64="0Bu8uDU59fMXNw/eLFexobUvGOg=">AAAIY3icfZXrbts2FMfV7hIv3da03bdhADejwFAYhpQ1adphQLt66IYlmZMmaZrIMCiKsomQkkbSnR1Wn/c0+7o9yx5g77FDSXZ0aSrA1tH5/Q95eHgLUs6Udt1/b9z84MOPPl7rfLJ+69PPPr+9cefuiUpmktBjkvBEngZYUc5ieqyZ5vQ0lRSLgNNXwcVzy1+9oVKxJD7Si5SOBJ7ELGIEa3CNN772NZ3rvB0jaZgZP6RcY+THFIVjP2XZeKPr9h/vbG8+3kRu3916uLXpgeFtu9vb3yGv7+ZP1ymf4fjO2gM/TMhM0FgTjpU699xUjwyWmhFOs3V/pmiKyQWe0HMwYyyoGpk8hwzdB0+IokTCL9Yo91YjDBZKLUQASoH1VDWZdV7H9FTUejdWJVWkGjnpaGdkWJzONI1JkVI040gnyJYQhUxSovkCDEwkg1EhMsUSEw2FrvfJJwkImt2u3IzUgaRcsctGicI3LFVlkeZFlWpBAbS+DvP1B0mEwHFofPIS5tGOjWBuXmZZg55W6GmLvq7Q1y36rEKftehhhR626KBCBy16VqFnLbpXoXstelShRy26W6G7WbNY6qCCD5rB6qRCT1rBwX6Jg8DsN2NTKkWJFZFm2IomAZYg6L31e82cwxL572CKCmbZ9xbV2TzOzHxsvCex7ew+bOYI+XSewnpFq0R/sokOKOxNSffA9RskinUiH5hSmplCC4IJlcvACEFgI5dCr7Kl5ccJZ4KBBxLLO5fT5BJagNfY5GeNIuYyb2eJVROrGtYl1kvfpcBzZM6u9PB9FRFecGQGV/DX3WxZdz2liaTCcArTskthCFnNL63/kAosL+qAwJYzzxOZcKjIos7s5jZHxUdzNuIEFsC5NzI+p5FG/lvU9WBMbDK1H41Ssjh6v35sJXrRCgvhFMmDyun9JRuXDZxX4kdZOSVwURTHfQjjzI98OQky4/Z3enC899ysrZpISuOVDjT9LattZBLwGS0SubpScp/pes11Y/ttSFfptNVF/+/Ql4nlEevXrWlYxIKBCt5+z1rvE8JSKoVgLdfw8AUqepbCDF+sltp0H9xTrNH+0jOw/8ZP7I1LNRxy5skPK/mA/l5nFsHturxC0fXGyWbfA/vgYffpj+U923G+dL5xvnU855Hz1PnZGTrHDnH+dP5y/nb+Wfuvc6tzt/NFIb15o4y559Sezlf/A2TtD1g=</latexit><latexit sha1_base64="0Bu8uDU59fMXNw/eLFexobUvGOg=">AAAIY3icfZXrbts2FMfV7hIv3da03bdhADejwFAYhpQ1adphQLt66IYlmZMmaZrIMCiKsomQkkbSnR1Wn/c0+7o9yx5g77FDSXZ0aSrA1tH5/Q95eHgLUs6Udt1/b9z84MOPPl7rfLJ+69PPPr+9cefuiUpmktBjkvBEngZYUc5ieqyZ5vQ0lRSLgNNXwcVzy1+9oVKxJD7Si5SOBJ7ELGIEa3CNN772NZ3rvB0jaZgZP6RcY+THFIVjP2XZeKPr9h/vbG8+3kRu3916uLXpgeFtu9vb3yGv7+ZP1ymf4fjO2gM/TMhM0FgTjpU699xUjwyWmhFOs3V/pmiKyQWe0HMwYyyoGpk8hwzdB0+IokTCL9Yo91YjDBZKLUQASoH1VDWZdV7H9FTUejdWJVWkGjnpaGdkWJzONI1JkVI040gnyJYQhUxSovkCDEwkg1EhMsUSEw2FrvfJJwkImt2u3IzUgaRcsctGicI3LFVlkeZFlWpBAbS+DvP1B0mEwHFofPIS5tGOjWBuXmZZg55W6GmLvq7Q1y36rEKftehhhR626KBCBy16VqFnLbpXoXstelShRy26W6G7WbNY6qCCD5rB6qRCT1rBwX6Jg8DsN2NTKkWJFZFm2IomAZYg6L31e82cwxL572CKCmbZ9xbV2TzOzHxsvCex7ew+bOYI+XSewnpFq0R/sokOKOxNSffA9RskinUiH5hSmplCC4IJlcvACEFgI5dCr7Kl5ccJZ4KBBxLLO5fT5BJagNfY5GeNIuYyb2eJVROrGtYl1kvfpcBzZM6u9PB9FRFecGQGV/DX3WxZdz2liaTCcArTskthCFnNL63/kAosL+qAwJYzzxOZcKjIos7s5jZHxUdzNuIEFsC5NzI+p5FG/lvU9WBMbDK1H41Ssjh6v35sJXrRCgvhFMmDyun9JRuXDZxX4kdZOSVwURTHfQjjzI98OQky4/Z3enC899ysrZpISuOVDjT9LattZBLwGS0SubpScp/pes11Y/ttSFfptNVF/+/Ql4nlEevXrWlYxIKBCt5+z1rvE8JSKoVgLdfw8AUqepbCDF+sltp0H9xTrNH+0jOw/8ZP7I1LNRxy5skPK/mA/l5nFsHturxC0fXGyWbfA/vgYffpj+U923G+dL5xvnU855Hz1PnZGTrHDnH+dP5y/nb+Wfuvc6tzt/NFIb15o4y559Sezlf/A2TtD1g=</latexit><latexit sha1_base64="0Bu8uDU59fMXNw/eLFexobUvGOg=">AAAIY3icfZXrbts2FMfV7hIv3da03bdhADejwFAYhpQ1adphQLt66IYlmZMmaZrIMCiKsomQkkbSnR1Wn/c0+7o9yx5g77FDSXZ0aSrA1tH5/Q95eHgLUs6Udt1/b9z84MOPPl7rfLJ+69PPPr+9cefuiUpmktBjkvBEngZYUc5ieqyZ5vQ0lRSLgNNXwcVzy1+9oVKxJD7Si5SOBJ7ELGIEa3CNN772NZ3rvB0jaZgZP6RcY+THFIVjP2XZeKPr9h/vbG8+3kRu3916uLXpgeFtu9vb3yGv7+ZP1ymf4fjO2gM/TMhM0FgTjpU699xUjwyWmhFOs3V/pmiKyQWe0HMwYyyoGpk8hwzdB0+IokTCL9Yo91YjDBZKLUQASoH1VDWZdV7H9FTUejdWJVWkGjnpaGdkWJzONI1JkVI040gnyJYQhUxSovkCDEwkg1EhMsUSEw2FrvfJJwkImt2u3IzUgaRcsctGicI3LFVlkeZFlWpBAbS+DvP1B0mEwHFofPIS5tGOjWBuXmZZg55W6GmLvq7Q1y36rEKftehhhR626KBCBy16VqFnLbpXoXstelShRy26W6G7WbNY6qCCD5rB6qRCT1rBwX6Jg8DsN2NTKkWJFZFm2IomAZYg6L31e82cwxL572CKCmbZ9xbV2TzOzHxsvCex7ew+bOYI+XSewnpFq0R/sokOKOxNSffA9RskinUiH5hSmplCC4IJlcvACEFgI5dCr7Kl5ccJZ4KBBxLLO5fT5BJagNfY5GeNIuYyb2eJVROrGtYl1kvfpcBzZM6u9PB9FRFecGQGV/DX3WxZdz2liaTCcArTskthCFnNL63/kAosL+qAwJYzzxOZcKjIos7s5jZHxUdzNuIEFsC5NzI+p5FG/lvU9WBMbDK1H41Ssjh6v35sJXrRCgvhFMmDyun9JRuXDZxX4kdZOSVwURTHfQjjzI98OQky4/Z3enC899ysrZpISuOVDjT9LattZBLwGS0SubpScp/pes11Y/ttSFfptNVF/+/Ql4nlEevXrWlYxIKBCt5+z1rvE8JSKoVgLdfw8AUqepbCDF+sltp0H9xTrNH+0jOw/8ZP7I1LNRxy5skPK/mA/l5nFsHturxC0fXGyWbfA/vgYffpj+U923G+dL5xvnU855Hz1PnZGTrHDnH+dP5y/nb+Wfuvc6tzt/NFIb15o4y559Sezlf/A2TtD1g=</latexit>

⇡<latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit>

Figure 6: Figure 2, enlarged. Red arrows highlight states where δ and dπ have opposite signs.

21

Figure 7: Four interest functions sampled from {−1, 1}n, along with their corresponding flow dπ,adversarial value function, and corresponding policy. The top example was chosen to illustratea scenario where dπ(x) < 0 but V π(x) > 0; the other three were selected at random. In ourexperiments, sampling from [−1, 1]n yielded qualitatively similar results.

22

Figure 8: 16-dimensional representations learned by training a deep network to predict the valuefunction of a single policy, namely: the uniformly random policy, the optimal policy, and a convexcombination of the two in equal proportions.

23

Figure 9: 16-dimensional representation generated by the proto-value function method (Mahadevanand Maggioni, 2007) applied to left-singular vectors of the transition function corresponding to theuniformly random policy. The top-left feature, labelled ’1’, corresponds to the second largest singularvalue. Notice the asymmetries arising from the absorbing goal state and the walls.

24

Random deterministic policies

Random AVFs

Random stochastic policies

Proto-value functions

Figure 10: 16-dimensional representations generated from the principal components of differentsets of value functions. Beginning in the top-left corner, in clockwise order: from k = 1000 AVFssampled according as in 1; proto-value functions (9); from k = 1000 random deterministic policies(RDPs); and finally from k = 1000 random stochastic policies. Of the four, only PVFs and AVFscapture the long-range structure of the four-room domain.

25

20 AVFs 100 AVFs

400 AVFs 1000 AVFs

Figure 11: 16-dimensional representations generated from the principal components of sets of AVFsof varying sizes (k = 20, 100, 400, 1000). To minimize visualization variance, each set of AVFscontains the previous one. The accompanying video at https://www.youtube.com/watch?v=q_XG7GhImQQ shows the full progress from k = 16 to k = 1024.

26

https://www.youtube.com/watch?v=q_XG7GhImQQ

https://www.youtube.com/watch?v=q_XG7GhImQQ

Random policies

Corresponding value functions

Figure 12: A sample of random deterministic policies, together with their corresponding value func-tions. These policies are generated by assigning a random action to each state. Under this samplingscheme, it is unlikely for a long chain of actions to reach the goal, leading to the corresponding valuefunctions being zero almost everywhere.

Figure 13: Average return (left) and average steps to goal (right), achieved by policies learned using arepresentation, with given number of features, produced by VALUE, AVF, or PVF. Average is over allstates and 20 random seeds, and shading gives standard deviation.

27

Documents

A Geometric Perspective on Optimal Representations for ... · Pablo Samuel Castro 1, Nicolas Le Roux , Dale Schuurmans;4, Tor Lattimore2, Clare Lyle5 Abstract We propose a new perspective