Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
On the Complexity of Markov Decision
Problems
Yichen Chen
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Computer Science
Adviser: Professor Mengdi Wang
June 2020
c© Copyright by Yichen Chen, 2020.
All rights reserved.
Abstract
The Markov decision problem (MDP) is one of the most basic models for sequential
decision-making problems in a dynamic environment where outcomes are partly ran-
dom. It models a stochastic control process in which a planner makes a sequence
of decisions as the system evolves. The Markov decision problem provides a mathe-
matical framework for dynamic programming, stochastic control, and reinforcement
learning. In this thesis, we study the complexity of solving MDPs.
In the first part of the thesis, we propose a class of stochastic primal-dual methods
for solving MDPs. We formulate the core policy optimization problem of the MDP
as a stochastic saddle point problem. By utilizing the value-policy duality structure,
the algorithm samples state-action transitions and makes alternative updates to value
and policy estimates. We prove that our algorithm is able to find the approximately
optimal policies to Markov decision problems with small space and computational
complexity. Using linear models to represent the value functions and the policies, our
algorithm is capable of scaling to problems with infinite and continuous state spaces.
In the second part of the thesis, we establish the computational complexity lower
bounds for solving MDPs. We prove our results by modeling the MDP algorithms
using branching programs and then characterizing the properties of these programs
by quantum arguments. The analysis is also extended to the study of the complexity
of solving two-player turn-based Markov games. Our results show that if we have a
simulator that can sample according to the transition probability function in O(1),
the lower bounds have reduced dependence on the state number. These results sug-
gest a fundamental difference between Markov decision problems with and without a
simulator.
We believe that our results provide a new piece of theoretical evidence for the
success of simulation-based methods in solving MDPs and Markov games.
iii
Acknowledgements
I would like to give my deepest gratitude to my advisor, Professor Mengdi Wang, for
her guidance, patience, and inspiration in the past five years. It is she who introduced
me to the world of dynamic programming. From then on, she has been a continuing
source of wisdom and support. I benefit tremendously from her taste of research,
meticulous attention to detail, and keen insights into solving research problems. I
am forever indebted to her for investing so much time in training me as a researcher,
reading over my papers, and improving my presentation skills. Beyond research, she
encouraged me to break out of my comfort zone, gave me plenty of opportunities to
attend conferences, and introduced me to world-class scientists in the field. Working
with her has been one of the most amazing experiences of my life.
I would like to thank Professor Sanjeev Arora and Professor Robert Schapire
– two leading figures in the research of theoretical computer science and machine
learning – for being on my thesis committee. Their commitment to research is truly
inspiring and motivates me to keep exploring the unknown. It has been a great honor
for me to interact with them over the years within Princeton. I am also grateful to
Professor Yuxin Chen and Professor Karthik Narasimhan for the valuable suggestions
and comments they have made regarding my research.
My special thanks go to my collaborator Lihong Li, who is also my mentor during
my internship at Google. I appreciate the discussion with him about the interesting
problems we worked together. I am grateful for his invaluable advice regarding re-
search, presentation, writing, and others. He is always ready to help if I encounter
any difficulties in research or life. I also want to thank my localhost Jingtao Wang
who has made my internship at Google such an enjoyable journey.
I had a great time working at Princeton. I would like to thank the (past and
present) members of my research group: Saeed Ghadimi, Xudong Li, Lin Yang, Jian
Ge, Galen Cho, Hao Lu, Yifan Sun, Yaqi Duan, Hao Gong, Zheng Yu. I would like to
iv
thank my other friends in Princeton: Weichen Wang, Junwei Lu, Levon Avanesyan,
Zongxi Li, Suqi Liu, Cong Ma, Kaizheng Wang, Xiaozhou Li, Nanxi Kang, Xin Jin,
Xinyi Fan, Linpeng Tang, Haoyu Zhang, Kelvin Zou, Yixin Sun, Yinda Zhang, Jun
Su, Zhengyu Song and Yixin Tao. Thank you all for making this journey enjoyable.
Finally, I would like to thank my parents Tianlei Chen and Zhixia Tao for their
unconditional love and support throughout my life. I would also like to thank my
beloved partner, Yiqin Shen, for her love, inspiration, and company. I dedicate this
thesis to them.
v
To my family.
vi
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
2 Stochastic Primal-Dual Methods for Markov Decision Problems 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Preliminaries and Formulation of the Problem . . . . . . . . . . . . . 10
2.2.1 Finite-State Discounted-Reward MDP . . . . . . . . . . . . . 10
2.2.2 Finite-State Finite-Horizon MDP . . . . . . . . . . . . . . . . 11
2.3 Value-Policy Duality of Markov Decision Processes . . . . . . . . . . 14
2.3.1 Finite-State Discounted-Reward MDP . . . . . . . . . . . . . 14
2.3.2 Finite-State Finite-Horizon MDP . . . . . . . . . . . . . . . . 18
2.4 Stochastic Primal-Dual Methods for Markov Decision Problems . . . 22
2.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Sample Complexity Analysis of Discounted-Reward MDPs . . 27
2.5.2 Sample Complexity Analysis of Finite-Horizon MDPs . . . . . 30
2.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
3 Primal-Dual π Learning Using State and Action Features 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Preliminaries and Formulation of the Problem . . . . . . . . . . . . . 40
3.2.1 Infinite-Horizon Average-Reward MDP . . . . . . . . . . . . . 41
3.2.2 Infinite-Horizon Discounted-Reward MDP . . . . . . . . . . . 44
3.3 Model Reduction of MDP using State and Action Features . . . . . . 45
3.3.1 Using State and Action Features As Bases . . . . . . . . . . . 46
3.3.2 Reduced-Order Bellman Saddle Point Problem for Average-
Reward MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.3 Reduced-Order Bellman Saddle Point Problem for Discounted-
Reward MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Primal-Dual π Learning for Average-Reward MDPs . . . . . . . . . . 50
3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Sample Complexity Analysis . . . . . . . . . . . . . . . . . . . 52
3.5 Primal-Dual π Learning for Discounted-Reward MDPs . . . . . . . . 63
3.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.2 Sample Complexity Analysis . . . . . . . . . . . . . . . . . . . 64
3.6 Related Literatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Complexity Lower Bounds of Discounted-Reward MDPs 76
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Family of Hard Instances of MDP . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Hard Instances of Standard MDP . . . . . . . . . . . . . . . . . 84
4.4.2 Hard Instances of CDP MDP and Binary Tree MDP . . . . . . . . 85
4.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
viii
4.5.1 A Sub-Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.2 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . 88
4.5.3 Proofs of Theorem 4.3.2 and 4.3.3 . . . . . . . . . . . . . . . . 92
4.5.4 Proof of Lemma 4.5.1 . . . . . . . . . . . . . . . . . . . . . . . 93
5 Complexity Lower Bounds of Markov Games 97
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Hard Instances of Markov Games . . . . . . . . . . . . . . . . . . . . 107
5.4.1 Hard Instances of Array Markov Game . . . . . . . . . . . . . 107
5.4.2 Hard Instances of CDP Markov Game and Binary Tree Markov
Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5.1 Computational Model . . . . . . . . . . . . . . . . . . . . . . 113
5.5.2 Proof of Theorem 5.3.1 . . . . . . . . . . . . . . . . . . . . . . 114
5.5.3 Proof of Theorem 5.3.2 . . . . . . . . . . . . . . . . . . . . . . 118
5.6 Extension to Markov Games with Irreducibility Property . . . . . . . 120
A Proofs in Chapter 2 122
A.1 Analysis of the SPD-dMDP Algorithm 2.1 . . . . . . . . . . . . . . . 122
A.1.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . 123
A.1.2 Proof of Theorem 2.5.1 . . . . . . . . . . . . . . . . . . . . . . 130
A.1.3 Proof of Theorem 2.5.2 . . . . . . . . . . . . . . . . . . . . . . 131
A.1.4 Proof of Theorem 2.5.3 . . . . . . . . . . . . . . . . . . . . . . 133
A.2 Analysis of the SPD-fMDP Algorithm 2.2 . . . . . . . . . . . . . . . 134
A.2.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . 134
A.2.2 Proof of Theorem 2.5.4 . . . . . . . . . . . . . . . . . . . . . . 145
ix
A.2.3 Proof of Theorem 2.5.5 . . . . . . . . . . . . . . . . . . . . . . 146
A.2.4 Proof of Theorem 2.5.6 . . . . . . . . . . . . . . . . . . . . . . 148
B Proofs in Chapter 3 150
B.1 Proof of Lemmas 3.4.1 and 3.4.2 . . . . . . . . . . . . . . . . . . . . . 150
B.2 Proof of Lemmas 3.5.1 and 3.5.2 . . . . . . . . . . . . . . . . . . . . . 155
C Proofs in Chapter 5 159
C.1 Supporting Lemmas for Theorem 5.3.1 . . . . . . . . . . . . . . . . . 159
C.1.1 Proof of Lemma 5.5.1 . . . . . . . . . . . . . . . . . . . . . . . 159
C.1.2 Proof of Lemma 5.5.2 . . . . . . . . . . . . . . . . . . . . . . . 163
C.2 Supporting Lemmas for Theorem 5.3.2 . . . . . . . . . . . . . . . . . 165
C.2.1 Proof of Lemma 5.5.3 . . . . . . . . . . . . . . . . . . . . . . . 165
C.2.2 Proof of Lemma 5.5.4 . . . . . . . . . . . . . . . . . . . . . . 167
Bibliography 168
x
List of Figures
4.1 Input of Standard MDP: Arrays of Transition Probabilities forM1 ∈M1
and M2 ∈M2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Input of CDP MDP : Arrays of Cumulative Probabilities for M3 ∈ M3
and M4 ∈M4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Input of Binary Tree MDP: (a) The snippet of the binary tree of M3 ∈
M3, where all transitions are to bad states. (b) The snippet of the
binary tree of M4 ∈ M4, where there is some (s, a) which transitions
to a good state sG,2 with probability �. . . . . . . . . . . . . . . . . . 87
5.1 A hard instance of MDP: In case of Type I states, the state transitions
to a random bad state s′ ∈ SB with probability 1 for all action a ∈ AN .
In case of Type II states, the state transitions to a rewarding state
s̄ ∈ SG with probability 1 under some action ā ∈ AN . . . . . . . . . . 107
5.2 A hard instance of CDP Markov Game and Binary Tree Markov Game: In
case of Type III states, the state transitions to the first 1/� states each
with probability � for all action a ∈ AN . In case of Type IV states, the
state transitions to a rewarding state s̄ ∈ SG with probability � under
some action ā ∈ AN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xi
Chapter 1
Introduction
Reinforcement learning lies at the intersection between control, machine learning,
and stochastic processes. Recent empirical successes demonstrate that reinforcement
learning, combined with sophisticated function approximation techniques (e.g., deep
neural networks), can conquer even the most challenging tasks in video and board
games [81, 97]. These successes have drawn attention to the research of the Markov
decision process, which provides a mathematical framework for dynamic program-
ming, stochastic control, and reinforcement learning. The Markov decision process
is a controlled random walk over a state space S, where in each state s ∈ S, one
can choose an action a from an action space A so that the random walk transitions
to another state s′ ∈ S with probability P (s, a, s′) and an immediate reward r(s, a).
The key goal is to identify an optimal policy that, when running on the Markov
decision process, maximize some cumulative function of rewards. We use the expres-
sion Markov decision problem for a Markov decision process together with such an
optimality criterion [90].
Researchers have been developing methods for solving Markov decision problems
for decades. There are three major approaches for solving the MDP: the value iter-
ation method, the policy iteration method, and the linear programming method. In
1
1957, Bellman [8] developed an iteration algorithm, called value iteration, to compute
the optimal total reward function, which is guaranteed to converge in the limit. In
policy iteration [58], the algorithm alternates between a value determination phase,
in which the current policy is evaluated, and a policy improvement phase, in which
the policy is updated according to the evaluation. Around the same time, D’Epenoux
[39] and de Ghellinck [38] discovered that the MDP has an LP formulation, allowing
it to be solved by general LP methods such as the simplex method [35], the Ellipsoid
method [60] or the interior-point algorithm [59].
As the notion of computational complexity emerged, there were tremendous efforts
in analyzing the complexity of MDPs and methods for solving them. All of the
three methods mentioned above are proved to be able to solve MDPs in polynomial
time [104, 71, 116, 68]. Here, polynomial time means that the number of arithmetic
operations needed to compute an exact optimal policy is bounded by a polynomial in
the number of states, actions and the bit-size of the data. However, though being able
to solve MDPs in polynomial time, all the methods have a computational complexity
that is superlinear to the input size Θ(|S|2|A|). This is mainly due to the nature
of the goal to find the exact optimal policy – it requires the full knowledge of the
system, which suffers from the curse of dimensionality.
In this thesis, we study the problem of finding the approximately optimal policies
for MDPs: is there a way to trade the precision of the exact optimal policy for a
better time complexity? In particular, we are interested in reducing the complexity’s
dependence on the size of the state space and the size of the action space. To serve this
purpose, we developed a class of stochastic primal-dual methods for solving MDPs.
We formulate the core policy optimization problem of the MDP as a stochastic saddle
point problem. By utilizing the value-policy duality structure of MDPs and leveraging
the powerful machinery from optimization, our primal-dual algorithms can find the
approximately optimal policies with small space and computational complexity. By
2
adopting linear models to represent the high-dimensional value function and state-
action density functions, our algorithm is capable of scaling to problems with infinite
and continuous state spaces.
Meanwhile, we are interested in the computational complexity lower bounds for
solving MDPs. Here, the computational complexity lower bounds mean the minimum
number of arithmetic operations or queries to the input data, as a function of |S| and
|A|, that is required for any algorithms to solve the MDP with high probability. In
contrast to the large volumes of upper bound results, there are fewer results on the
complexity lower bounds [87, 52]. In this thesis, we establish the first computational
complexity lower bounds for solving MDPs. The analysis is also extended to the
study of the computational complexity for two-player turn-based Markov games. Our
results show that if we have a simulator that can sample according to the transition
probability function in O(1), the lower bounds have reduced dependence on the state
number. These results suggest a fundamental difference between problems with and
without a simulator, which explains the success of the simulation-based approaches
for Markov decision problems in recent years [97].
Here we give an overview of each chapter in the thesis. Chapter 2 and Chapter 3
study the complexity of MDPs from the positive side, developing new algorithms with
provable convergence rate guarantees. Chapter 4 and Chapter 5 study the complexity
from the negative side, establishing the first computational complexity lower bounds
for solving MDPs and Markov games.
Stochastic primal-dual methods for Markov decision problems (Chapter 2
[109, 23])
We study the online estimation of the optimal policy of a Markov decision process
(MDP). Central to the problem is to learn the optimal policy and/or the value function
of the system. In the empirically successful actor-critic algorithm [62], the searching
3
procedure updates the value function and the policy simultaneously. In essence, it uses
the value estimates to update the policy more accurately and uses the policy estimate
to update the value function with less variance. In Chapter 2, we justify this procedure
using the optimization theory. We show the dual relationship between the value
function and the policy function in a saddle point formulation of the Markov decision
process. The interpretation demonstrates the primal-dual nature of the alternating
update methods. Based on this insight, we propose a class of Stochastic Primal-Dual
(SPD) methods that exploit the inherent minimax duality of Bellman equations. The
SPD methods update a few coordinates of the value and policy estimates as a new
state transition is observed. We prove the convergence rate of these algorithms and
show that they are both space-efficient and computationally efficient.
Primal-dual π learning using state and action features (Chapter 3 [22])
Approximate linear programming (ALP) represents one of the major algorithmic fam-
ilies to solve large-scale Markov decision processes (MDPs). In this chapter, we study
a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm
called primal-dual π learning for reinforcement learning when a sampling oracle is
provided. This algorithm enjoys several advantages. First, it adopts linear models to
represent the high-dimensional value function and state-action density functions, re-
spectively, using given state and action features. Its run-time complexity depends on
the number of features, not the size of the underlying MDPs. Second, our algorithm
and analysis apply to an arbitrary state space, even if the state space is continuous
and infinite. Third, it operates in a fully online fashion without having to store any
sample, thus having minimal memory footprint. Fourth, we prove that it is sample-
efficient, solving for the optimal policy to high precision with a sample complexity
linear in the dimension of the parameter space.
4
Complexity lower bounds of discounted-reward MDPs (Chapter 4)
We study the computational complexity of the infinite-horizon discounted-reward
Markov Decision Problem (MDP) with a finite state space S and a finite action space
A. We show that any randomized algorithm needs a running time at least Ω(|S|2|A|)
to compute an �-optimal policy with high probability. We consider two variants of
the MDP where the input is given in specific data structures, including arrays of
cumulative probabilities and binary trees of transition probabilities. For these cases,
we show that the complexity lower bound reduces to Ω(|S||A|�
). These results reveal
a surprising observation that the computational complexity of the MDP depends on
the data structure of the input.
Complexity lower bounds of Markov games (Chapter 5)
Recent years have seen the huge success of reinforcement learning in strategic games
for artificial intelligence [81, 97]. The reinforcement learning agent learns to play
optimal actions from games of self-play. In game theory, such systems are modeled
by stochastic games, which have been studied for decades [95]. Surprisingly, it is
only in recent years that 2-player turn-based stochastic games have been proved to
be solvable in strongly polynomial time [53]. In Chapter 5, we studied the compu-
tational complexity lower bounds for solving 2-player turn-based stochastic games.
Our results show that the worst-case computational complexity lower bounds depend
on the data structure used in representing the stochastic game. When the transition
probabilities of the game are given in the form of arrays, the lower bounds have a
linear dependence on the number of states and actions. When the probabilities are
given in the format that allows efficient simulation of the game, the lower bounds
have a sublinear dependence on the number of states and actions. Our results sug-
gest that the efficient simulator used in reinforcement learning methods might be a
contributing factor to its empirical success.
5
Chapter 2
Stochastic Primal-Dual Methods
for Markov Decision Problems
2.1 Introduction
Markov decision process (MDP) is one of the most basic models of dynamic pro-
gramming, stochastic control and reinforcement learning; see the textbook references
[10, 13, 100, 90]. Given a controllable Markov chain and the distribution of state-
to-state transition rewards, the aim is to find the optimal action to perform at each
state in order to maximize the expected overall reward. MDP and its numerous vari-
ants are widely applied in engineering systems, artificial intelligence, e-commerce and
finance. Classical solvers of MDP require full knowledge of the underlying stochastic
process and reward distribution, which are often not available in practice.
In this chapter, we study both the infinite-horizon discounted-reward MDP and
the finite-horizon MDP. In both cases, we assume that the MDP has a finite state
space S and a finite action space A. We focus on the model-free learning setting
where both transition probabilities and transitional rewards are unknown. Instead,
a simulation oracle is available to generate random state-to-state transitions and
6
transitional rewards. The simulation oracle is able to model offline retrieval of static
empirical data as well as live interaction with real-time simulation systems. The
algorithmic goal is to estimate the optimal policy of the unknown MDP based on
empirical state transitions, without any prior knowledge or restrictive assumption
about the underlying process. In the literature of approximate dynamic programming
and reinforcement learning, many methods have been developed, and some of them
are proved to achieve near-optimal performance guarantees in certain senses; recent
examples including [85, 34, 67, 66, 109]. Although researchers have made significant
progress in developing reinforcement learning methods, it remains unclear whether
there is an approach that achieves both theoretical optimality and practical scalability.
This is an active area of research.
In this chapter, we present a novel approach motivated by the linear programming
formulation of the nonlinear Bellman equation. We formulate the Bellman equation
into a stochastic saddle point problem, where the optimal primal and dual solutions
correspond to the optimal value and policy functions, respectively. We propose a class
of Stochastic Primal-Dual algorithms (SPD) for the discounted MDP and the finite-
horizon MDP. Each iteration of the algorithms updates the primal and dual solutions
simultaneously using noisy partial derivatives of the Lagrangian function. We show
that one can compute a noisy partial derivative efficiently from a single observation of
the state transition. The SPD methods are stochastic analogs of the primal-dual iter-
ation for linear programming. They also involve alternative projections onto specially
constructed sets. The SPD methods are straightforward to implement and exhibit
favorable space complexity. To analyze its sample complexity, we adopt the notion
of “Probably Approximately Correct” (PAC), which means to achieve an �-optimal
policy with high probability using sample size polynomial to the problem parameters.
The main contributions of this chapter are four-folded:
7
1. We study the basic linear algebra of reinforcement learning. We show that
the optimal value and optimal policy are dual to each other, and they are
the solutions to a stochastic saddle point problem. The value-policy duality
implies a convenient algebraic structure that may facilitate efficient learning
and dimension reduction.
2. We develop a class of stochastic primal-dual (SPD) methods that maintain a
value estimate and a policy estimate and update their coordinates while pro-
cessing state-transition data incrementally. The SPD methods exhibit superior
space and computational scalability. They require O(|S| × |A|) space for dis-
counted MDP and O(|S| × |A| × H) space for finite-horizon MDP. The space
complexity of SPD is sublinear to the input size of the MDP model. For dis-
counted MDP, each iteration updates two coordinates of the value estimate and
a single coordinate of the policy estimate. For finite-horizon MDP, each iter-
ation updates 2H coordinates of the value estimate and H coordinates of the
policy estimate.
3. For discounted MDP, we develop the SPD-dMDP Algorithm 2.1. It yields
an �-optimal policy with probability at least 1 − δ using the following sample
size/iteration number
O(|S|4|A|2σ2
(1− γ)6�2ln
(1
δ
)),
where γ ∈ (0, 1) is the discount factor, |S| and |A| are the sizes of the state space
and the action space, σ is a uniform upperbound of state-transition rewards.
We obtain the sample complexity results by analyzing the duality gap sequence
and applying the Bernstein inequality to a specially constructed martingale.
The analysis is novel to the authors’ best knowledge.
4. For finite-horizon MDP, we develop the SPD-fMDP Algorithm 2.2. It yields
an �-optimal policy with probability at least 1 − δ using the following sample8
size/iteration number
O(|S|4|A|2H6σ2
�2ln
(1
δ
)),
where H is the total number of periods. The key aspect of the finite-horizon
algorithm is to adapt the learning rate/stepsize for updates on different periods.
In particular, the algorithm has to update the policies associated with the earlier
periods more aggressively than update those associated with the later periods.
The SPD is a model-free method and applies to a wide class of dynamic programming
problems. Within the scope of this chapter, the sample transitions are drawn from a
static distribution. We conjecture that the sample complexity results can be improved
by allowing exploitation, i.e., adaptive sampling of actions. The results of this chapter
suggest that the linear duality of MDP bears convenient structures yet to be fully
exploited.
Chapter Organization Section 2.2 reviews the basics of discounted and finite-
horizon MDP and related works in this area. Section 2.3 studies the duality between
optimal values and policies. Section 2.4 presents the SPD-dMDP and the SPD-fMDP
algorithms and discuss their implementation and complexities. Section 2.5 presents
the main results and the proofs are deferred to the appendix.
Notations All vectors are considered as column vectors. For a vector x ∈ Rn, we
denote by xT its transpose, and denote by ‖x‖ =√xTx its Euclidean norm. For a
matrix A ∈ Rn×n, we denote by ‖A‖ = max{‖Ax‖ | ‖x‖ = 1} its induced Euclidean
norm. For a set X ⊂ Rn and vector y ∈ Rn, we denote by ΠX{y} = argminx∈X‖y −
x‖2 the Euclidean projection of y on X , where the minimization is always uniquely
attained if X is nonempty, convex and closed. We denote by e = (1, . . . , 1)T the
vector with all entries equaling 1, and we denote by ei = (0, ..., 0, 1, 0, ..., 0)T the
9
vector with its i-th entry equaling 1 and other entries equaling 0. For set X , we
denote its cardinality by |X |.
2.2 Preliminaries and Formulation of the Problem
In this section, we review the basic models of Markov decision processes.
2.2.1 Finite-State Discounted-Reward MDP
We consider a discounted MDP described by a tuple M = (S,A,P , r, γ), where S
is a finite state space, A is a finite action space, γ ∈ (0, 1) is a discount factor. If
action a is selected while the system is in state i, the system transitions to state j
with probability Pa(i, j) and incurs a random reward r̂ija ∈ [0, σ] with expectation
rija.
Let π : S 7→ A be a policy that maps a state i ∈ S to an action π(i) ∈ A. Consider
the Markov chain under policy π. We denote its transition probability matrix by Pπ
and denote its transitional reward vector by rπ∗ , i.e.,
Pπ(i, j) = Pπ(i)a(i, j), rπ(i) =∑j∈S
Pπ(i)(i, j)rijπ(i), i, j ∈ S.
The objective is to find an optimal policy π∗ : S 7→ A such that the infinite-horizon
discounted reward is maximized, regardless of the initial state:
maxπ:S7→A
Eπ
[∞∑k=0
γkr̂ikik+1π(ik)
],
where γ ∈ (0, 1) is a discount factor, (i0, i1, . . .) are state transitions generated by the
Markov chain under policy π, and the expectation is taken over the entire process.
We assume throughout that there exists a unique optimal policy π∗ to the MDP tuple
M = (S,A, P, r, γ). In other words, there exists one optimal action for each state.
10
We review the standard definitions of value functions.
Definition 2.2.1. The value vector vπ ∈ R|S| of a fixed policy π is defined as
vπ(i) = Eπ
[∞∑k=0
γkr̂ikik+1π(ik)
∣∣∣ i0 = i] , i ∈ S.Definition 2.2.2. The optimal value vector v∗ ∈ R|S| is defined as
v∗(i) = maxπ:S7→A
Eπ
[∞∑k=0
γkr̂ikik+1π(ik)
∣∣∣ i0 = i] i ∈ S.For the sample complexity analysis of the proposed algorithm, we need a notion
of sub-optimality of policies. We give its definition as below.
Definition 2.2.3. We say that a policy π is absolute-�-optimal if
maxi∈S|vπ(i)− v∗(i)| ≤ �.
If a policy is absolute-�-optimal, it achieves �-optimal reward regardless of the ini-
tial state distribution. We note that the absolute-�-optimality is one of the strongest
notions of sub-optimality for policies. In comparison, some literature analyzes the
expected sub-optimality of a policy when the state i follows a certain distribution.
2.2.2 Finite-State Finite-Horizon MDP
We also consider the finite-horizon Markov decision process, which can be formu-
lated as a tuple M = (S,A, H,P , r), where S is a finite state space with transition
probabilities encoded by P = (Pa)a∈A ∈ R|S×S×A|, A is a finite action space and H
is the horizon. If action a is selected while the system is in state i ∈ S at period
h = 0, . . . , H − 1, the system transitions to state j and period h+ 1 with probability
Pa(i, j) and incurs a random reward r̂ija with expectation rija. For the simplicity
11
of presentation, we assume that the reward we receive does not depend on the time
period. Our algorithm can be readily extended to the case when the reward varies
with the time period. We assume that both P and r are unknown but they can be
estimated by sampling.
We augment the state space with the time period to obtain an augmented Markov
chain. Now we have a replica of S at each period h, denoted by Sh. Let S[H]
be the state space of the augmented MDP where [H] denotes the set of integers
{0, 1, . . . , H − 1}. If we select action a in state (i, h) ∈ S[H], the state transitions
to a new state (j, h + 1) ∈ S[H] with probability Pa(i, j). The transition incurs a
random reward r̂ija ∈ [0, σ] with expectation rija. At period H − 1, the state i will
transition to the terminal state with reward r̂ija. In the rest of the chapter, we use
Πa(i′, j′) to denote the transition probability of the augmented Markov chain where
i′, j′ ∈ {(i, h)|i ∈ S, h ∈ [H]}.
Let π = (π0, . . . , πH−1) be a sequence of one-step policies such that πh maps a
state i ∈ S to an action πh(i) ∈ A in the h-th period. Consider the augmented
Markov chain under policy π. We denote its transition probability matrix by Ππ,
where Ππ ((i, h), (j, h+ 1)) = Pπh(i)(i, j) for all h ∈ [H − 1], i, j ∈ S, which is given
by
Ππ =
0 Pπ0 0 . . . 0
0 0 Pπ1 . . . 0
......
.... . .
...
0 0 0 . . . PπH−2
0 0 0 . . . 0
.
We denote ra ∈ R|S| to be the expected state transition reward under action a such
that
ra(i) =∑j∈S
Pa(i, j)rija, i ∈ S.
We review the standard definitions of value functions for finite-horizon MDP.
12
Definition 2.2.4. The h-period value function vπh : Sh 7→ Rn under policy π is defined
as
vπh(i) = Eπ
[H−1∑τ=h
r̂iτ iτ+1πτ (iτ )
∣∣∣∣ ih = i],∀ h ∈ [H], i ∈ S.
The random variables (ih, ih+1, . . . ) are state transitions generated by the Markov
chain under policy π starting from state i and period h, and the expectation Eπ[·]
is taken over the entire process. We denote the overall value function to be v =
(vT0 , . . . , vTH−1)
T ∈ R|S|H .
The objective of the finite-horizon MDP is to find an optimal policy π∗ : S[H] 7→ A
such that the finite-horizon reward is maximized, regardless of the starting state.
Based on the optimal policy, we define the optimal value function as follows.
Definition 2.2.5. The optimal value vector v∗ = (v∗h)h∈[H] ∈ RH×|S| is defined as
v∗h(i) = maxπ:S[H] 7→A
Eπ
[H−1∑τ=h
r̂iτ iτ+1πτ (iτ )
∣∣∣∣ ih = i]
= Eπ∗
[H−1∑τ=h
r̂iτ iτ+1πτ (iτ )
∣∣∣∣ ih = i],
for all h ∈ [H], i ∈ S.
In order to analyze the sample complexity, we use the following notion of absolute-
�-optimal for finite-horizon MDP.
Definition 2.2.6. We say that a policy π is absolute-�-optimal if
maxh∈[H],i∈S
|vπh(i)− v∗h(i)| ≤ �.
Note that if a policy is absolute-�-optimal, it achieves an �-optimal cumulative
reward from all states and all intermediate periods. This is one of the strongest
notions of sub-optimality for finite-horizon policies.
13
2.3 Value-Policy Duality of Markov Decision Pro-
cesses
In this section, we study the Bellman equation of the Markov decision process from
the perspective of linear duality. We show that the optimal value and policy functions
are dual to each other and they are the solutions to a special saddle point problem.
We analyze the value-policy duality for the infinite-horizon discounted-reward case
and the finite-horizon case separately.
2.3.1 Finite-State Discounted-Reward MDP
Consider the discounted MDP described by the tupleM = (S,A,P , r, γ) as in Section
2.2.1. According to the theory of dynamic programming [10], a vector v∗ is the optimal
value function to the MDP if and only if it satisfies the following |S| × |S| system of
equations, known as the Bellman equation, given by
v∗(i) = maxa∈A
{γ∑j∈S
Pa(i, j)v∗(j) +
∑j∈S
Pa(i, j)rija
}, i ∈ S. (2.3.1)
When γ ∈ (0, 1), the Bellman equation has a unique fixed point solution v∗, and it
equals to the optimal value function of the MDP. Moreover, a policy π∗ is an optimal
policy for the MDP if and only of it attains the minimization in the Bellman equation.
Note that this is a nonlinear system of fixed point equations. Interestingly, the Bell-
man equation (2.3.1) is equivalent to the following |S|× (|S||A|) linear programming
(LP) problem (see Puterman, 2014 [90] Section 6.9. and the paper by de Farias and
Van Roy [36]):
minimize ξTv
subject to (I− γPa) v − ra ≥ 0, a ∈ A,(2.3.2)
14
where ξ is an arbitrary vector with positive entries, Pa ∈ R|S|×|S| is matrix whose
(i, j)-th entry equals to Pa(i, j), I is the identity matrix with dimension |S| × |S|
and ra ∈ R|S| is the expected state transition reward under action a given by ra(i) =∑j∈S Pa(i, j)rija, i ∈ S. The dual linear program of (2.3.2) is
maximize∑a∈A
λTa ra
subject to∑a∈A
(I− γP Ta
)λa = ξ, λa ≥ 0, a ∈ A.
(2.3.3)
We will show that the optimal solution λ∗ to the dual problem (2.3.3) corresponds
to the optimal policy π∗ of the MDP. The duality between the optimal value vector
and the optimal policy is established in Theorem 2.3.1. We remark that part of the
results was known in the classical literature of MDP; see Puterman, 2014 [90] Section
6.9. We provide a short proof for the completeness of the analysis.
Theorem 2.3.1 (Value-Policy Duality for Discounted MDP). Assume that the
discounted-reward infinite-horizon MDP tuple M = (S,A, P, r, γ) has a unique
optimal policy π∗. Then (v∗, λ∗) is the unique pair of primal and dual solutions to
(2.3.2), (2.3.3) if and only if
v∗ = (I− γPπ∗)−1rπ∗ ,(λ∗π∗(i),i
)i∈S
= (I− γ(Pπ∗)T )−1ξ, λ∗a,i = 0 if a 6= π∗(i).
Proof. The proof is based on the fundamental property of linear programming, i.e.,
(v∗, λ∗) is the optimal pair of primal and dual solutions if and only if:
(a) v∗ is primal feasible, i.e., (I− γPa) v∗ − ra ≥ 0 for all a ∈ A.
(b) λ∗ is dual feasible, i.e.,∑
a∈A(I− γP Ta
)λ∗a = ξ and λ
∗a ≥ 0 for all a ∈ A.
15
(c) (v∗, λ∗) satisfies the complementarity slackness condition
λ∗a,i · (v∗i − γPa,iv∗ − ra,i) = 0 ∀i ∈ S, a ∈ A,
where λ∗a,i is the i-th element of λ∗a and Pa,i is the i-th row of Pa.
Suppose that (v∗, λ∗) is primal-dual optimal. As a result, it satisfies (a), (b), (c)
and v∗ is the optimal value vector. By the definition of the optimal value function, we
know that v∗i−γPπ∗(i),iv∗−rπ∗(i),i = 0. Since π∗ is unique, we have v∗i > γPa,iv∗+ra,i =
0 if a 6= π∗(i). As a result, we have λ∗a,i = 0 for all a 6= π∗(i). It means that the
optimal dual variable λ∗ has exactly |S| nonzeros, corresponding to |S| active row
constraints of the primal problem (2.3.2). We combine this observation with the dual
feasibility relation∑
a∈A(I− γP Ta
)λ∗a = ξ and obtain
(I− γ(Pπ∗)T )(λ∗π∗(i),i
)i∈S
= ξ.
Note that all eigenvalues of Pπ∗ belong to the unit ball, therefore (I − γ(Pπ∗)T ) is
invertible. Then we have(λ∗π∗(i),i
)i∈S
= (I − γ(Pπ∗)T )−1ξ, which together with the
complementarity condition implies that λ∗ is unique. Similarly, we can show that
v∗ = (I− γPπ∗)−1rπ∗ from the primal feasibility and slackness condition.
Now suppose that (v∗, λ∗) satisfies the three conditions stated in Theorem 2.3.1.
Then we obtain (a),(b),(c) directly, which proves that (v∗, λ∗) is primal-dual optimal.
�
Theorem 2.3.1 suggests a critical correspondence between the optimal dual solu-
tion λ∗ and the optimal policy π∗. In particular, one can recover the optimal policy
π∗ from the basis of λ∗ as follows:
π∗(i) = a, if λ∗a,i > 0.
16
In other words, finding the optimal policy is equivalent to finding the basis of the
optimal dual solution. This suggests that learning the optimal policy is a special case
of stochastic linear programs.
Saddle Point Formulation of Discounted MDP We rewrite the LP program
(2.3.2) into an equivalent minimax problem, given by
minv∈R|S|
maxλ≥0
L(v, λ) = ξTv +∑a∈A
λTa ((γPa − I)v + ra) . (2.3.4)
The primal variable v is of dimension |S|, and the dual variable
λ = (λa)a∈A = (λa,i)a∈A,i∈S
is of dimension |S| · |A|. Each subvector λa ∈ R|S| is the vector multiplier correspond-
ing to constraint inequalities (I− γPa) v − ra ≥ 0. Each entry λa,i > 0 is the scalar
multiplier associated with the ith row of (I− γPa) v − ra ≥ 0.
In order to develop an efficient algorithm, we modify the saddle point problem as
follows
minv∈R|S|
maxλ∈R|S×A|
{L(v, λ) = ξTv +
∑a∈A
λTa ((γPa − I)v + ra)
},
subject to v ∈ V , λ ∈ Ξ ∩∆
(2.3.5)
where
V ={v∣∣∣ v ≥ 0, ‖v‖∞ ≤ σ
1− γ
},
Ξ =
{λ∣∣∣ ∑a∈A
λa,i ≥ ξi, ∀i ∈ S
}, ∆ =
{λ∣∣∣ λ ≥ 0, ‖λ‖1 = ‖ξ‖1
1− γ
}.
(2.3.6)
17
We will show later that v∗ and λ∗ belong to V and Ξ∩∆ respectively (Lemma A.1.1).
As a result, the modified saddle point problem (2.3.5) is equivalent to the original
problem (2.3.4).
2.3.2 Finite-State Finite-Horizon MDP
Consider the finite-horizon Markov decision process, which is described by a tuple
M = (S,A, H,P , r) as in Section 2.2.2. The Bellman equation of finite-horizon
MDP is given by
v∗h(i) = maxa∈A
{ra(i) +
∑j∈S
Pa(i, j)v∗h+1(j)
},∀ h ∈ [H], i ∈ S,
v∗H = 0,
(2.3.7)
where Pa is the transition probability matrix using a fixed action a. The vector form
of the Bellman equation is
v∗h = maxa∈A
{ra(i) + Pav
∗h+1
}, ∀ h ∈ [H], i ∈ S,
v∗H = 0,
where the maximization is carried out component-wise. A vector v∗ satisfies the
Bellman equation if and only if it is the optimal value function.
The Bellman equation is equivalent to the following linear program:
minimizeH−1∑h=0
ξTh vh
subject to vh ≥ Pavh+1 + ra, a ∈ A, h ∈ [H]
vH = 0,
(2.3.8)
18
where ξ = (ξ0, . . . , ξH−1) is an aribrary vector with positive entries and v =
(vT0 , . . . , vTH−1)
T is the primal variable. The above linear program has |S||A|H
constraints and |S|H variables. The dual linear program of (2.3.8) is given by
maximizeH−1∑h=0
∑a∈A
λTh,ara
subject to∑a∈A
(λh,a − P Ta λh−1,a
)= ξh, h ∈ [H]
λ−1,a = 0, λh,a ≥ 0, h ∈ [H], a ∈ A
(2.3.9)
where the dual variable λ = (λh,a)h∈[H],a∈A is a vector of dimension |S||A|H. In the
remainder of the chapter, we use the notation λh to denote the vector λh = (λh,a)a∈A
and λa to denote the vector λa = (λh,a)h∈[H]. We denote by λ∗ the optimal dual
solution. Now we establish the value-policy duality for finite-horizon MDP.
Theorem 2.3.2 (Value-Policy Duality for Finite-Horizon MDP). Assume that the
finite-horizon MDP tuple M = (S,A, H,P , r) has a unique optimal policy π∗ =
(π∗h)h∈[H]. The vector pair (v∗, λ∗) is the unique pair of primal and dual solutions to
(2.3.8), (2.3.9) if and only if:
v∗ = (I−Ππ∗)−1(Rh,π∗)h∈[H],(λ∗h,π∗h(i)
(i))h∈[H],i∈S
= (I−ΠTπ∗)−1ξ, λ∗h,a(i) = 0 if a 6= π∗h(i),
where Ππ∗ is the transition probability under the optimal policy π∗, Rh,π∗ is the expected
transitional reward under policy, i.e., Rh,π∗(i) = rπ∗h(i)(i) for all i ∈ S and h ∈ [H].
Proof. The proof is based on the fundamental property of the linear duality, i.e.,
(v∗, λ∗) is the optimal pair of primal and dual solutions if and only if
(a) v∗ is primal feasible, i.e., v∗h − Pav∗h+1 − ra ≥ 0, ∀ h ∈ [H], a ∈ A.
(b) λ∗ is dual feasible, i.e.,∑
a∈A(λh,a − P Ta λh−1,a
)= ξh, λh,a ≥ 0, ∀ h ∈
[H], a ∈ A.
19
(c) (v∗, λ∗) satisfies the complementarity slackness condition
λ∗h,a(i) · (v∗h(i)− ra(i)− Pa,iv∗h+1) = 0 ∀ h ∈ [H], i ∈ S, a ∈ A,
where λh,a(i) is the i-th element of λh,a and Pa,i is the i-th row of Pa.
Suppose that (v∗, λ∗) is primal-dual optimal. As a result, it satisfies (a), (b), (c)
and v∗ is the optimal value vector. By the definition of the optimal value function,
we know that v∗h(i) − Pπ∗h(i),iv∗h+1 − rπ∗h(i)(i). Since π
∗ is unique, we have v∗h(i) >
Pa,iv∗h+1 + ra(i) = 0 if a 6= π∗h(i). As a result, we have λ∗h,a(i) = 0 for all a 6= π∗h(i).
Together with the dual feasibility relation∑
a∈A(λh,a − P Ta λh−1,a
)= ξh, we have
(I− ΠTπ∗)(λ∗h,π∗h(i)(i)
)h∈[H],i∈S
= ξ,
where Ππ∗ is defined in Section 2.2.2. Note that (I − ΠTπ∗) is invertible as its deter-
minant is equal to one. Then we have(λ∗h,π∗h(i)
(i))h∈[H],i∈S
= (I − ΠTπ∗)−1ξ, which
together with the complementarity condition implies that λ∗ is unique. Similarly, we
can show that v∗ = (I− Ππ∗)−1(Rh,π∗)h∈[H] from the primal feasibility and slackness
condition.
Now suppose that (v∗, λ∗) satisfies the three conditions stated in Theorem 2.3.2.
Then we obtain (a),(b),(c) directly, which proves that (v∗, λ∗) is primal-dual optimal.
�
Saddle Point Formulation of Finite-Horizon MDP According to the La-
grangian duality, we can rewrite the Bellman equation (2.3.8) into the following saddle
point problem:
minv∈V
maxλ∈Ξ∩∆
L(v, λ) =H−1∑h=0
ξTh vh +H−1∑h=0
∑a∈A
λTh,a (ra + Pavh+1 − vh) , (2.3.10)
20
where
V ={v∣∣∣ vh ≥ 0, ‖vh‖∞ ≤ (H − h)σ, ∀ h ∈ [H]} ,
Ξ =
{λ∣∣∣ ∑a∈A
λh,a ≥ ξh, ∀ h ∈ [H]
},∆ =
{λ∣∣∣ λ ≥ 0, ‖λh‖1 = h∑
τ=0
‖ξh‖1, ∀ h ∈ [H]
}.
(2.3.11)
We will prove later that the optimal primal solution v∗ and the optimal dual solutions
λ∗ belong to the additional constraints V and Ξ∩∆ respectively (Lemma A.2.1). As
a result, (v∗, λ∗) is a pair of primal and dual solutions to the saddle point problem
(2.3.10).
We now state the matrix form of the saddle point problem. Let I be the identity
matrix with dimension |S|H and let Πa ∈ R|S|H×|S|H be the augmented transition
probability matrix of the augmented Markov chain, taking the form
Πa =
0 Pa 0 . . . 0
0 0 Pa . . . 0
......
.... . .
...
0 0 0 . . . Pa
0 0 0 . . . 0
.
Then the Lagrangian L(v, λ) is equivalent to
L(v, λ) = ξTv +∑a∈A
λTa (Ra + (Πa − I)v) , (2.3.12)
where Ra = (rTa , . . . , r
Ta )
T , λa = (λh,a)h∈[H]. The partial derivatives of the Lagrangian
are
∇vL(v, λ) = ξ +∑a∈A
(ΠTa − I)λa, ∇λaL(v, λ) = Ra + (Πa − I)v.
In what follows, we will utilize linear structures of the partial derivatives to develop
stochastic primal-dual algorithms.
21
2.4 Stochastic Primal-Dual Methods for Markov
Decision Problems
We are interested in developing algorithms that not only apply to explicitly given
MDP models but also apply to reinforcement learning. In particular, we focus on the
model-free learning setting, which is summarized as below.
Model-Free Learning Setting of MDP
(i) The state space S, the action spaces A, the reward upperbound σ, and the dis-
count factor γ (or the horizon H) are known.
(ii) The transition probabilities P and reward functions r are unknown.
(iii) There is a Sampling Oracle (SO) that takes input (i, a) and generates a new state
j with probabilities Pa(i, j) and a random reward r̂ija ∈ [0, σ] with expectation
rija.
Motivated by the value-policy duality (Theorems 2.3.1, 2.3.2), we develop a class of
stochastic primal-dual methods for the saddle point formulation of Bellman Equation.
In particular, we develop the Stochastic Primal-Dual method for Discounted Markov
Decision Process (SPD-dMDP), as given in Algorithm 2.1.
The algorithm keeps an estimate of the value function v and the dual iterates λ
in the minimax problem (2.3.4) and makes updates to them while sampling random
states and actions. At the k-th iteration, the algorithm samples a random state i and
a random action a. Then the algorithm uses the sampled state and action to compute
noisy gradients of the minimax problem (2.3.4) with respect to the value function v
and the dual iterates λ. After updating the value function and the dual iterates using
the gradients, the algorithm projects v and λ back to the corresponding constraints
22
V and Ξ ∩ ∆ in the optimization problem (2.3.5). Essentially, each iteration is a
stochastic primal-dual iteration for solving the saddle point problem.
We also develop the Stochastic Primal-Dual method for Finite-horizon Markov
Decision Process (SPD-fMDP), as given by Algorithm 2.2. The SPD algorithms
maintain a running estimate of the optimal value function (i.e., the primal solution)
and the optimal policy (i.e., the dual solution). They make simple updates to the
value and policy estimates as new state and reward observations are drawn from the
sampling oracle.
Implementation and Computational Complexities The proposed SPD algo-
rithms are essentially stochastic coordinate descent methods. They exhibit favorable
properties such as small space complexity and small computational complexity per
iteration.
The SPD-dMDP Algorithm 2.1 updates the value and policie estimates by pro-
cessing one sample state-transition at a time. It keeps track of the dual variable λ and
primal variable v, which utilizes |A|× |S|+ |S| = O(|A|× |S|) space. In comparison,
the dimension of the discounted MDP is |A|× |S|2. Thus the space complexity of Al-
gorithm 2.1 is sublinear with respect to the problem size. Moreover, the SPD-dMDP
Algorithm 2.1 is a coordinate descent method. It updates two coordinates of the
value estimation and one coordinate of the policy estimate per each iteration. After
the dual update, the algorithm first projects the dual iterates onto the simplex ∆ and
then projects the dual iterates onto the box constraint Ξ. The projection onto the
simplex can be implemented using O(|S| × |A|) [25]. The projection onto Ξ under
the simplex constraint can be implemented using O(|S| × |A|) arithmetic operations
as well.1 The overall computation complexity per each iteration is O(|S|× |A|) arith-1Consider the projection of x ∈ ∆ onto Ξ∩∆. We can write the problem as miny∈∆
∑a
∑i(ya,i−
xa,i)2, s.t.
∑a ya,j ≥ c,∀j, where c is a constant. The derivative of the objective with respect to ya,i
is 2(ya,i−xa,i). Denote y∗ to be the optimal solution. For two variables y∗a1,i1 , y∗a2,i2
, the derivativesof the objective with respect to them 2(y∗a1,i1 − xa1,i1), 2(y
∗a2,i2
− xa2,i2) must equal to each other
23
Algorithm 2.1 Stochastic Primal-Dual Algorithm for Discounted MDP (SPD-dMDP)
Input: Sampling Oracle SO, n = |S|, m = |A|, γ ∈ (0, 1), σ ∈ (0,∞)Initialize v(0) : S 7→
[0, σ
1−γ
]and λ(0) : S ×A 7→
[0, ‖ξ‖1σ
1−γ
]arbitrarily.
Set ξ = σ√ne
for k = 1, 2, . . . , T doSample i uniformly from SSample a uniformly from ASample j and r̂ija conditioned on (i, a) from SOSet β =
√n/k
Update the primal iterates by
v(k)(i)← max{
min
{v(k−1)(i)− β
(1
mξ(i)− λ(k−1)a (i)
),
σ
1− γ
}, 0
}v(k)(j)← max
{min
{v(k−1)(j)− γβλ(k−1)a (i),
σ
1− γ
}, 0
}v(k)(s)← v(k−1)(s) ∀s 6= i, j
Update the dual iterates by
λ(k− 1
2)
a (i)← λ(k−1)(a, i) + β(γv(k−1)(j)− v(k−1)(i) + r̂ija
)λ(k−
12
)(a′, i′)← λ(k−1)(a′, i′), ∀ (a′, i′) such that a′ 6= a or i′ 6= i
Project the dual iterates by
λ(k) ← ΠΞΠ∆λ(k−12
), where Ξ and ∆ are given by (2.3.6)
end forOuput: Averaged dual iterate λ̂ = 1
T
∑Tk=1 λ
(k) and randomized policy π̂ where
P(π̂(i) = a) = λ̂a(i)∑a∈A λ̂a(i)
metic operations. Therefore SPD Algorithm 2.1 uses sublinear space complexity and
sublinear computation complexity per iteration.
Algorithm 2.2 has a similar spirit with Algorithm 2.1. It keeps track of the
dual variable (λh,a)h∈[H],a∈A (randomized policies for all periods) and primal vari-
if∑
a y∗a,i1
> c,∑
a y∗a,i2
> c. Othewise, we can move the values of y∗a1,i1 , y∗a2,i2
to get a smallerobjective value. Thus we can compute y∗ explicitly as follows. For i such that
∑a xa,i < c, we set
y∗·,i so that∑
a y∗a,i = c. For the remaining variables y
∗a,i, we set the value of y
∗a,i to be xa,i added
by a constant shift so that∑
a
∑i y∗a,i is 1. Note that this projection algorithm also works for other
distance metrics such as KL divergence:∑
a,i ya,i log(ya,i/xa,i).
24
able v0, . . . , vH−1 (value functions for all periods). Algorithm 2.2 is specially designed
for H-period MDP in two aspects. It uses a non-uniform weight vector where ξ0 =eH
and ξh =e
(H−h)(H−h+1) , which places more weight for later periods to balance the
smaller values. It uses larger stepsizes to update policies associated with later pe-
riods more aggressively, while using small stepsizes to update earlier-period policies
more conservatively. The space complexity is O(|S| × |A| × H), which is sublinear
with respect to the problem dimension O(|S|2×|A|×H). Algorithm 2.2 is essentially
a coordinate descent method that involves projection onto simple sets. The compu-
tation complexity per each iteration is O(|S| × |A| ×H), which is mainly due to the
projection step.
Comparisons with Existing Methods Our SPD algorithms are fundamentally
different from the existing reinforcement learning methods. From a theoretical per-
spective, our SPD algorithms are based on the stochastic saddle point formulation of
the Bellman equation. To the authors’ best knowledge, this idea has not been used
in any existing method. From a practical perspective, the SPD methods are easy to
implement and have small space and computational complexity (one of the smallest
compared to existing methods). In what follows, we compare the newly proposed
SDP methods with several popular existing methods.
• The new SPD methods share a similar spirit with the Q-learning and delayed
Q-learning methods. Both of them maintain and update a value for each state-
action pair (i, a). Delayed Q-learning maintains estimates of the value function
at each state-action pair (i, a), i.e., the Q values. Our SPD maintains estimates
of probabilities for choosing each (i, a), i.e., the randomized policy. In both
cases, the values associated with state-action pairs are used to determine how
to choose the actions. Delayed Q-learning uses O(|S||A|) space and O(ln |A|)
25
Algorithm 2.2 Stochastic Primal-Dual Algorithm for Finite-horizon MDP (SPD-fMDP)
Input: Sampling Oracle SO, n = |S|, m = |A|, H, σ ∈ (0,∞)Initialize vh : S 7→ [0, (H − h)σ] and λh : S ×A 7→
[0, n
H−h
],∀ h ∈ [H] arbitrarily
Set ξ0 =eH
and ξh =e
(H−h)(H−h+1) ,∀ h 6= 0for k = 1, 2, . . . , T do
Sample i uniformly from SSample a uniformly from ASample j and r̂ija conditioned on (i, a) from SOUpdate the primal iterates by
v(k)h (i)← max
{min
{v
(k−1)h (i)−
(H − h)2σ√k
(ξh(i)−mλ
(k−1)h,a (i)
), (H − h)σ
}, 0
},
v(k)h (j)← max
{min
{v
(k−1)h (j)−
m(H − h)2σ√k
λ(k−1)h−1,a(i), (H − h)σ
}, 0
},
v(k)h (s)← v
(k−1)h (s) , ∀ h ∈ [H], s 6= i, j
Update the dual iterates by
λ(k− 1
2)
h,a (i)← λ(k−1)h,a (i) +
n
(H − h)2σ√k
(v
(k−1)h+1 (j)− v
(k−1)h (i) + r̂ija
),∀ h ∈ [H]
λ(k− 1
2)
h,a′ (i′)← λ(k−1)h,a (i), ∀ h ∈ [H],∀ a
′, i′ such that a′ 6= a or i′ 6= i
Project the dual iterates by
λ(k) ← ΠΞΠ∆λ(k−12
), where Ξ and ∆ are given by (2.3.11)
end forOuput: Averaged dual iterate λ̂ = 1
T
∑Tk=1 λ
(k) and randomized policy π̂ =
(π̂0, · · · , π̂H−1) where P(π̂h(i) = a) = λ̂h,a(i)∑a∈A λ̂h,a(i)
arithmetic operations per iteration [99]. Our SPD methods enjoy similar com-
putational advantages as the delayed Q-learning method.
• The SPD methods are also related to the class of actor-critic methods. Our
dual variable mimics the actor, while the primal variable mimics the critic.
In particular, the dual update step in SPD turns out to be very similar to
the actor update: both updates use a noisy temporal difference. Actor-critic
methods are two-timescale methods in which the actor updates on a faster scale
26
in comparison to the critic. In contrast, the new SPD methods have only one
timescale: the primal and dual variables are updated using a single sequence of
stepsizes. As a result, SPD methods are more efficient in utilizing new data as
they emerge and achieve O(1/√T ) rate of convergence.
• Upper confidence methods maintain and update a value or interval for each
state-action-state triplet; see the works by Lattimore and Hutter [66], Dann
and Brunskill [34]. These methods use space up to O(|S|2|A|), which is linear
with respect to the size of the MDP model. In contrast, SPD does not estimate
transition probabilities of the unknown MDP and uses only O(|S||A|) space.
To sum up, the main advantage of the SPD methods is the small storage and small
computational complexity per iteration. We note that the main computational com-
plexity of SPD is due to the projection of dual variables.
2.5 Main Results
In this section, we study the convergence of the two SPD methods: the SPD-dMDP
Algorithm 2.1 and the SPD-fMDP Algorithm 2.2. Our main results show that the
SPD methods output a randomized policy that is absolute-�-optimal using finite sam-
ples with high probability. We analyze the cases of discounted MDP and finite-horizon
MDP separately. All the proofs of the theorems are deferred to the appendix.
2.5.1 Sample Complexity Analysis of Discounted-Reward
MDPs
We analyze the SPD-dMDP Algorithm 2.1 as a stochastic analog of a deterministic
primal-dual iteration. We show that the duality gap associated with the primal and
dual iterates decreases to zero with the following guarantee.
27
Theorem 2.5.1 (PAC Duality Gap). For any � > 0, δ ∈ (0, 1), let λ̂ = (λ̂a)a∈A ∈
R|S×A| be the averaged dual iterates generated by the SPD-dMDP Algorithm 2.1 using
the following sample size/iteration number
Ω
(|S|3|A|2σ4
(1− γ)4�2ln
(1
δ
)).
Then the dual iterate λ̂ satisfies
∑a∈A
(λ̂a)T (v∗ − γPav∗ − ra) ≤ �
with probability at least 1− δ.
Proof Outline. The SPD-dMDP Algorithm 2.1 can be viewed as a stochastic approxi-
mation scheme for the saddle point problem (2.3.5). Upon drawing a triplet (ik, ak, jk),
we obtain noisy samples of partial derivatives ∇vL(v(k), λ(k)) and ∇λL(v(k), λ(k)). The
SPD-dMDP Algorithm 2.1 is equivalent to
vk+1 = ΠV [vk − βk
(∇vL(v(k), λ(k)) + �k
)]
λk+1 = ΠΞΠ∆[λk + βk
(∇λL(v(k), λ(k)) + εk
)],
where V , Ξ and ∆ are specially constructed sets, and �k, εk are zero-mean noise
vectors. βk is the stepsize. By analyzing the distance between (vk, λk) and (v∗, λ∗), we
obtain that the squared distance decreases by factors of the duality gap per iteration.
Then we construct a martingale based on the sequences of duality gaps and apply
Bernstein’s inequality. The formal proof is deferred to the appendix.
Note that the dual variable is always nonnegative λ̂ ≥ 0 by the projection on the
simplex ∆. Also note that the nonnegative vector v∗ − (ra + γPav∗) ≥ 0 is a vector
of primal constraint tolerances attained by the primal optimal solution v∗. Theorem
28
2.5.1 essentially gives an error bound for entries of λ̂ corresponding to inactive primal
row constraints.
Now we are ready to present the sample complexity of SPD for discounted MDP.
Theorem 2.5.2 shows that the averaged dual iterate λ̂ gives a randomized policy that
approximates the optimal policy π∗. The performance of the randomized policy can
be analyzed using the diminishing duality gap from Theorem 2.5.1.
Theorem 2.5.2 (PAC Sample Complexity). For any � > 0, δ ∈ (0, 1), let the SPD-
dMDP Algorithm 2.1 iterate with the following sample size/iteration number
Ω
(|S|4|A|2σ2
(1− γ)6�2ln
(1
δ
)),
then the output policy π̂ is absolute-�-optimal with probability at least 1− δ.
Next, we consider how to recover the optimal policy π∗ from the dual iterates λ̂
generated by the SPD-dMDP Algorithm 2.1. Note that the policy space is discrete,
which makes it possible to distinguish the optimal one from others when the estimated
policy π̂ is close enough to the optimal one.
Definition 2.5.1. Let the minimal action discrimination constant d̄ be the minimal
efficiency loss of deviating from the optimal policy π∗ by making a single wrong action.
It is given by
d̄ = min(i,a):π∗(i)6=a
(v∗(i)− γPa,iv∗ − ra(i)).
As long as there exists a unique optimal policy π∗, we have d̄ > 0. A large value
of d̄ means that it is easy to discriminate optimal actions from suboptimal actions.
A small value of d̄ means that some suboptimal actions perform similarly to optimal
actions.
29
Theorem 2.5.3 (Exact Recovery of The Optimal Policy). For any � > 0, δ ∈ (0, 1),
let the SPD-dMDP Algorithm 2.1 iterate with the following sample size
Ω
(|S|4|A|4σ2
d̄2(1− γ)4ln
(1
δ
)).
Let π̂Tr be obtained by rounding the randomized policy π̂ to the nearest deterministic
policy, given by
π̂Tr(i) = argmaxa∈Aλ̂a,i, i ∈ S.
Then P(π̂Tr = π∗
)≥ 1− δ.
To our best knowledge, this is the first result that shows how to recover the exact
optimal policy of reinforcement learning. The discrete nature of the policy space
makes the exact recovery possible.
2.5.2 Sample Complexity Analysis of Finite-Horizon MDPs
Now we analyze the SPD-fMDP Algorithm 2.2. Again we start with the duality gap
analysis. We have the following theorem.
Theorem 2.5.4 (PAC Duality Gap). For any � > 0, δ ∈ (0, 1), let λ̂ = (λ̂a)a∈A ∈
R|S×A| be the averaged dual iterates generated by the SPD-fMDP Algorithm 2.2 using
the following sample size/iteration number
Ω
(|S|4|A|2H2σ2
�2
(ln
1
δ
)).
Then the dual iterate λ̂ satisfies
H−1∑h=0
∑a∈A
(λ̂h,a)T (v∗h − Pav∗h+1 − ra) ≤ �.
with probability at least 1− δ.30
Proof Outline. We can view the SPD-fMDP Algorithm 2.2 as a stochastic approxi-
mation scheme for the saddle point problem (2.3.10). The SPD-fMDP Algorithm 2.2
is equivalent to the following iteration
vk+1h = ΠV [vkh −
(H − h)2σγkn
(∇vhL(v(k), λ(k)) + �k
)]
λk+1h = ΠΞΠ∆
[λkh +
γkm(H − h)2σ
(∇λhL(v(k), λ(k)) + εk
)],
where Ξ and ∆ are given by (2.3.11), �k and εk are zero-mean noise vectors, γk is the
stepsize. The formal proof is deferred to Section 6.
Next we present the sample complexity for the SPD-fMDP Algorithm 2.2. The
analysis is obtained from the duality gap.
Theorem 2.5.5 (PAC Sample Complexity). For any � > 0, δ ∈ (0, 1), let the SPD-
fMDP Algorithm 2.2 iterate with the following sample size
Ω
(|S|4|A|2H6σ2
�2ln
(1
δ
)),
then the output policy π̂ is absolute-�-optimal with probability at least 1− δ.
Next we consider how to recover the optimal policy π∗ from the dual iterates λ̂
generated by the SPD-fMDP Algorithm 2.2. We abuse the notation d̄ to denote the
minimal action discrimination for finite-horizon MDP.
Definition 2.5.2. Let the minimal action discrimination constant d̄ be the minimal
efficiency loss of deviating from the optimal policy π∗ by making a single wrong action.
It is given by
d̄ = min(h,i,a):π∗h(i)6=a
(v∗h(i)− Pa,iv∗h+1 − ra(i)).
As long as there exists a unique optimal policy π∗, we have d̄ > 0. Now we state
our last theorem.
31
Theorem 2.5.6 (Exact Recovery of The Optimal Policy). For any � > 0, δ ∈ (0, 1),
let π̂Tr be the truncated pure policy such that
π̂Trh (i) = argmaxa∈Aλ̂h,a(i), i ∈ Sh.
Let the SPD-fMDP Algorithm 2.2 iterate with the following iteration number/sample
size
Ω
(|S|4|A|4H6σ2
d̄2ln
(1
δ
)).
Then P(π̂Tr = π∗
)≥ 1− δ.
The results of Theorems 2.5.4, 2.5.5 and 2.5.6 for finite-horizon MDP are analogous
to Theorems 2.5.1, 2.5.2 and 2.5.3 for discounted-reward MDP. The horizon H plays
a role similar to the discounted infinite sum∑∞
k=0 γk = 1
1−γ .
2.6 Related Works
Our proposed methods use ideas from the linear program approach for Bellman’s
equations and the stochastic approximation method. The linear program formulation
of Bellman’s equation was known at around the same time when Bellman’s equation
was known; see [10, 90]. Ye [116] shows that the policy iteration of discounted MDP
is a form of the dual simplex method, which is strongly polynomial for the equivalent
linear program and terminates in run time O( |A×S|2
1−γ ). Cogill [28] considered the
exact primal-dual method for MDP with full knowledge and interpreted it as a form
of value iteration. Approximate linear programming approaches have been developed
for solving the large-scale MDP on a low-dimensional subspace, started with de Farias
and Van Roy [36] and followed by Veatch [106] and Abbasi-Yadkori et al. [2].
Our algorithm and analysis are closely related to the class of stochastic approxi-
mation (SA) methods. For textbook references on stochastic approximation, please
32
see [63, 9, 16, 12]. We also use the averaging idea by [89]. In particular, our algorithm
can be viewed as a stochastic approximation method for stochastic saddle point prob-
lems, which was first studied by Nemirovski and Rubinstein [84] without the rate of
convergence analysis. Recent works Dang and Lan [33] and Chen et al. [24] studied
first-order stochastic methods for a class of general stochastic convex-concave saddle
point problems and obtained optimal and near-optimal convergence rates.
In the literature of reinforcement learning, there have been works on dual temporal
difference learning which use primal-dual-type methods, see for examples [112, 72, 75,
76]. These works focused on evaluating the value function for a fixed policy. This is
different from our work, where the aim is to learn the optimal policy. We also remark
that a primal-dual learning method has been considered for the optimal stopping
problem by Borkar et al. [17]. However, no explicit sample complexity analysis is
available.
In the recent several years, a growing body of work has provided various rein-
forcement learning methods that are able to “learn” the optimal policy with sample
complexity guarantees. The notion of “Probably Approximately Correct” (PAC) was
considered for MDP by Strehl et al. [98], which requires that the learning method out-
puts an �-optimal policy using a sample size that is polynomial to parameters of the
MDP and 1/� with high probability. Since then, many methods have been developed
for discounted MDP and proved to achieve various PAC guarantees. Strehl et al. [98]
showed that the R-MAX has sample complexity O(S2A/(�3(1 − γ)6)) and Delayed
Q-learning has O(SA/(�4(1 − γ)8)). Lattimore and Hutter [66] proposed the Upper
Confidence Reinforcement Learning and obtained a PAC upper and lower bound
O(
SA�2(1−γ)3 log
1δ
)under a restrictive assumption: one can only move to two states
from any given state. Lattimore et al. [67] extended the analysis to more general
RL models. Azar et al. [6] showed that the model-based value iteration achieves the
optimal rate O(|S×A| log(|S×A|/δ)
(1−γ)3�2
)for discounted MDP. Dann and Brunskill [34] de-
33
veloped a upper confidence method for fixed-horizon MDP and obtain a near-optimal
rate O(S2AH2
�2ln 1
δ
). They also provide a lower bound Ω(SAH
2
�2ln 1
δ+c). Based on their
PAC bounds, Osband and Van Roy [86] conjecture that the regret lower bound for
reinforcement learning is Ω(√SAT ). Although the above confidence methods achieve
the close-to-optimal PAC complexity in some cases, they require maintaining a con-
fidence interval for each state-action-state triplet. Thus these methods are not yet
satisfactory in the space complexity. It remains unclear whether there is an approach
that achieves both the space efficiency and the near-optimal sample complexity guar-
antee, without estimating the transition probabilities. This motivates the research in
this chapter.
We emphasize that the SPD methods proposed in this chapter differ fundamentally
from the existing methods mentioned above. In contrast, the SPD methods are more
closely related to first-order stochastic methods for convex optimization and convex-
concave saddle point problems. The closest prior work to the current one is the
work and Wang and Chen [109], which proposed the first primal-dual-type learning
method and a loose error bound. No PAC analysis was given in [109]. In this chapter,
we develop a new class of stochastic primal-dual methods, which are substantially
improved in both practical efficiency and theoretical complexity. Practically, the new
algorithms are essentially coordinate descent algorithms involving projections onto
simple sets. As a result, each iteration is straightforward to implement, making the
algorithms practically favorable. Theoretically, these methods come with rigorous
sample complexity guarantees. The results of this chapter provide the first PAC
guarantee for primal-dual reinforcement learning.
34
2.7 Summary
We have presented a novel primal-dual approach for solving the MDP in the model-
free learning setting. A significant practical advantage of primal-dual learning meth-
ods is the small storage and computational efficiency. The SPD methods use O(|S|×
|A|) space and O(|S| × |A|) arithmetic operation per iteration, which is sublinear
with respect to dimensions of the MDP. We show that the SPD output an absolute-�-
optimal solution using O( |S|4|A|2�2
) samples. In comparison, it is known that the sample
complexity of reinforcement learning is bounded from below by O( |S||A|�2
) in a slightly
different setting [98, 34]. Clearly, our sample complexity results do not yet match the
lower bounds. However, these results represent our first steps in the study of rein-
forcement learning using the primal-dual approach. We will improve our algorithms
in Chapter 3 using a better update scheme on the dual variables, and show that the
new algorithms achieve the theoretical optimal sample complexity lower bounds.
We make several remarks about potential improvement and extensions of the
primal-dual learning methods.
• The SPD-dMDP Algorithms 2.1 and 2.2 require that the state-action pair is
sampled uniformly from |S| × |A|. In other words, the SPD-dMDP Algorithm
2.1 and 2.2 use pure exploration without any exploitation. Such a sampling
oracle is suitable for offline learning when a fixed-size static data set is given.
In the online learning setting, the sample complexity will be improved when
actions are sampled according to the latest value and policy estimates rather
than uniformly.
• Another extension is to consider average-reward discounted MDP. In the
average-reward case, the discount factor γ and horizon H will disappear from
the sample complexity. One cannot derive the sample complexity for average-
35
reward MDP from the current result. We will study this problem in Chapter 3
and give algorithms that achieve theoretical optimal bounds.
To the authors’ belief, the primal-dual approach studied in this chapter has significant
theoretical and practical potential. The bilinear stochastic saddle point formulation
of Bellman equations is amenable to online learning and dimension reduction. The
intrinsic linear duality between the optimal policy and value functions implies a con-
venient structure for efficient learning.
36
Chapter 3
Primal-Dual π Learning Using
State and Action Features
3.1 Introduction
Reinforcement learning lies at the intersection between control, machine learning,
and stochastic processes [14, 100]. The objective is to learn an optimal policy of a
controlled system from interaction data. The most studied model for a controlled
stochastic system is the Markov decision process (MDP), i.e., a controlled random
walk over a (possibly continuous) state space S, where in each state s ∈ S one can
choose an action a from an action space A so that the random walk transitions to
another state s′ ∈ S with density p(s, a, s′). In this paper, we do not assume the
MDP model is explicitly known but consider the setting where a generative model is
given (see, e.g., [7]). In other words, there is an oracle that takes (s, a) as input and
outputs a random s′ with density p(s, a, s′) and an immediate reward r(s, a) ∈ [0, 1].
This is also known as the simulator-defined MDP in some literatures [40, 102]. Our
goal is to find an optimal policy that, when running on the MDP to generate an
37
infinitely long trajectory, yields the highest average per-step reward in the limit or
the highest accumulative discounted reward.
Here, we focus on problems where the state and action spaces S andA are too large
to be enumerated. In practice, it might be computationally challenging even to store a
single state of the process (e.g., states could be high-resolution images). Suppose that
we are given a collection of state-action feature functions φ : S ×A 7→ Rm and value
feature functions ψ : S 7→ Rn. They map each state-action pair (s, a) and state s ∈ S
to column vectors φ(s, a) = (φ1(s, a), . . . , φm(s, a))T and ψ(s) = (ψ1(s), . . . , ψn(s))
T,
respectively, where m and n are much smaller than the sizes of S and A.
Our primary interest is to develop a sample-efficient and computationally scalable
algorithm, which takes advantage of the given features to solve an MDP with very
large or continuous state space and huge action space. Given the feature maps, φ
and ψ, we adopt linear models for approximating both the value function and the
stationary state-action density function of the MDP. By doing so, we can represent
the value functions and state-action density functions, which are high-dimensional
quantities, using a much smaller number of parameters.
Contributions. Our main contribution is a tractable, model-free primal-dual π-
learning algorithm for such compact parametric representations. The algorithm ap-
plies to a general state space, which may be continuous and infinite. It incrementally
updates
• The new algorithm is inspired by a saddle point formulation of policy optimiza-
tion in MDPs, which we refer to as the Bellman saddle point problem. We show
a strong relationship between the parametric saddle point problem and the orig-
inal Bellman equation. In particular, the difference between solutions to these
two problems can be quantified using the L∞- and L1-errors of the parametric
function classes that are used to approximate the optimal value function and
38
state-action density function, respectively. In the special case where the approx-
imation error is zero (which we refer to as the “realizable” scenario), solving the
parametric Bellman saddle point problem is equivalent to solving the original
Bellman equation.
• Each iteration of the algorithm can be viewed as a stochastic primal-dual iter-
ation for solving the Bellman saddle point problem, where the value and policy
updates are coupled in light of strong duality. We study the sample complexity
of the π learning algorithm by analyzing the coupled primal-dual convergence
process. We show that finding an �-optimal policy (comparing to the best
approximate policy) requires a sample size that is linear in m+n logn�2
, ignoring
constants. The sample complexity depends only on the numbers of state and
action features. It is invariant to the actual sizes of the state and action spaces.
Notations. The following notations are used throughout the paper. For any integer
n, we use [n] to denote the set of integers {1, 2, . . . , n}. Denote (Ω,F , ζ) to be an
arbitrary measure space. Let f : Ω 7→ R be a measurable function defined on (Ω,F , ζ).
Denote by∫
Ωfdζ the Lebesgue integral of f . If the meaning of ζ is clear from the
context, we use∫
Ωf(x)dx to denote the Lebesgue integral of f . If f is absolutely
integrable, we define the L1 norm ‖f‖1 of f to be ‖f‖1 =∫
Ω|f |dζ. If f is square
integrable, we define the L2 norm ‖f‖2 of f to be ‖f‖2 = (∫
Ωf 2dζ)
12 . The L∞
norm of f is defined to be the infimum of all the quantities M ≥ 0 that satisfy
|f(x)| ≤M for almost every x. Given two measurable functions f and g, their inner
product 〈f, g〉 is defined to be∫
Ωfgdζ. For two probability distributions, u and
w, over a finite set X, we denote by DKL(u‖w) their Kullback-Leibler divergence,
i.e., DKL(u‖w) =∑
x∈X u(x) logu(x)w(x)
. For two functions f(x) and g(x), we say that
f(x) = O(g(x)) if there exists a constant C such that |f(x)| ≤ Cg(x) for all x.
39
3.2 Preliminaries and Formulation of the Problem
Let (S,F) be the state space that is equipped with an appropriate measure, and
we use∫S f(s)ds to denote the integral over this measurable space. We let A be
a finite discrete action space, and we use∫A f(a)da to denote the integral over the
action space. A function p(s, a, s′), s ∈ S, a ∈ A, s′ ∈ S is called a Markov transition
function if for each state-action pair (s, a) ∈ S×A, the function p(s, a, s′) as a function
of s′ ∈ S, is a probability density function over S where∫S p(s, a, s
′)ds′ = 1, and for
each fixed s′ ∈ S, the function p(s, a, s′) is measurable as a function of (s, a) ∈ S×A.
Let v : S 7→ R be a function defined on S. A Markov transition function p(s, a, s′)
defines the transition operator P that maps v to a function Pv : S ×A 7→ R
(Pv)(s, a) =
∫Sv(s′)p(s, a, s′)ds′. (3.2.1)
Let s′ be a state sampled from p(s, a, ·) given a state-action pair (s, a). The transition
operator has the equivalent definition
(Pv)(s, a) = Es′|s,a[v(s′)].
We denote by M = (S,A, p, p0, r) a Markov decision process where S,A, p are
defined above, p0 : S 7→ R is a bounded initial state density function and r : S×A 7→
R is a reward function with r(s, a) ∈ [0, 1]. The agent starts from the initial state
drawn from p0 and takes actions sequentially. After taking action a at state s, the
agent transitions to the next state with density p(s, a, ·). During the transition, the
agent receives a reward r(s, a).
In this work, we focus on the case where the agent applies a randomized stationary
policy. A randomized stationary policy is a function π(s, a), s ∈ S, a ∈ A, where
π(s, ·) is a distribution over the action space. Denote pπ to be the probability density
40
function of p under a fixed policy π where
pπ(s, s′) =
∫Aπ(s, a)p(s, a, s′)da,
for all s, s′ ∈ S. A Markov transition function pπ(s, s′) also defines the operator PTπ
that acts on the probability density functions
(PTπ ν)(s′) =
∫Spπ(s, s
′)ν(s)ds, (3.2.2)
where ν is a bounded probability density function defined on S. Suppose that ν is the
distribution of the agent’s current state. Then PTπ ν is the distribution of the agent’s
next state after the agent takes an action according to π.
3.2.1 Infinite-Horizon Average-Reward MDP
Denote by Π the space of all randomized stationary policies. The policy optimization
problem is to maximize the infinite-horizon average reward over Π:
v̄∗ = maxπ∈Π
{v̄π = lim sup
T→∞Eπ
[1
T
T−1∑t=0
r(st, at)|s0 ∼ p0
]}, (3.2.3)
where (s0, a0, s1, a1, . . . , st, at, . . .) are state-action transitions generated by the
Markov decision process under π from the initial distribution p0, and the expectation
Eπ[·] is taken over the entire trajectory.
Under certain continuity and ergodicity assumptions [90, 56], the optimal average
reward v̄∗ is independent of the initial distribution p0, and the policy optimization
41
problem (3.2.3) is equivalent to the following optimization problem
v̄∗ = maxπ∈Π,ξ:S7→R
∫Sξ(s)
∫Aπ(s, a)r(s, a)dads
s.t. ξ(s′) =
∫Spπ(s, s
′)ξ(s)ds, ∀s′,
ξ ≥ 0,∫Sξ(s)ds = 1 ,
(3.2.4)
where the constraint ensures that ξ is the stationary denstiy function of states under
the policy π. Let µ(s, a) = ξ(s)π(s, a). Then policy optimization problem (3.2.4)
becomes a (possibly infinite dimensional) linear program
v̄∗ = maxµ:S×A7→R
∫S
∫Aµ(s, a)r(s, a)dads
s.t.
∫Aµ(s′, a)da =
∫S
∫Ap(s, a, s′)µ(s, a)dads , ∀s,
µ ≥ 0,∫S
∫Aµ(s, a)dads = 1 ,
(3.2.5)
where the constraint ensures that µ is a stationary state-action density of the MDP
under some policy. We denote by µ∗ the optim