On the Complexity of Markov Decision Problems...Abstract The Markov decision problem (MDP) is one of the most basic models for sequential decision-making problems in a dynamic environment

On the Complexity of Markov Decision

Problems

Yichen Chen

A Dissertation

Presented to the Faculty

of Princeton University

in Candidacy for the Degree

of Doctor of Philosophy

Recommended for Acceptance

by the Department of

Computer Science

Adviser: Professor Mengdi Wang

June 2020

c© Copyright by Yichen Chen, 2020.

All rights reserved.

Abstract

The Markov decision problem (MDP) is one of the most basic models for sequential

decision-making problems in a dynamic environment where outcomes are partly ran-

dom. It models a stochastic control process in which a planner makes a sequence

of decisions as the system evolves. The Markov decision problem provides a mathe-

matical framework for dynamic programming, stochastic control, and reinforcement

learning. In this thesis, we study the complexity of solving MDPs.

In the first part of the thesis, we propose a class of stochastic primal-dual methods

for solving MDPs. We formulate the core policy optimization problem of the MDP

as a stochastic saddle point problem. By utilizing the value-policy duality structure,

the algorithm samples state-action transitions and makes alternative updates to value

and policy estimates. We prove that our algorithm is able to find the approximately

optimal policies to Markov decision problems with small space and computational

complexity. Using linear models to represent the value functions and the policies, our

algorithm is capable of scaling to problems with infinite and continuous state spaces.

In the second part of the thesis, we establish the computational complexity lower

bounds for solving MDPs. We prove our results by modeling the MDP algorithms

using branching programs and then characterizing the properties of these programs

by quantum arguments. The analysis is also extended to the study of the complexity

of solving two-player turn-based Markov games. Our results show that if we have a

simulator that can sample according to the transition probability function in O(1),

the lower bounds have reduced dependence on the state number. These results sug-

gest a fundamental difference between Markov decision problems with and without a

simulator.

We believe that our results provide a new piece of theoretical evidence for the

success of simulation-based methods in solving MDPs and Markov games.

iii

Acknowledgements

I would like to give my deepest gratitude to my advisor, Professor Mengdi Wang, for

her guidance, patience, and inspiration in the past five years. It is she who introduced

me to the world of dynamic programming. From then on, she has been a continuing

source of wisdom and support. I benefit tremendously from her taste of research,

meticulous attention to detail, and keen insights into solving research problems. I

am forever indebted to her for investing so much time in training me as a researcher,

reading over my papers, and improving my presentation skills. Beyond research, she

encouraged me to break out of my comfort zone, gave me plenty of opportunities to

attend conferences, and introduced me to world-class scientists in the field. Working

with her has been one of the most amazing experiences of my life.

I would like to thank Professor Sanjeev Arora and Professor Robert Schapire

– two leading figures in the research of theoretical computer science and machine

learning – for being on my thesis committee. Their commitment to research is truly

inspiring and motivates me to keep exploring the unknown. It has been a great honor

for me to interact with them over the years within Princeton. I am also grateful to

Professor Yuxin Chen and Professor Karthik Narasimhan for the valuable suggestions

and comments they have made regarding my research.

My special thanks go to my collaborator Lihong Li, who is also my mentor during

my internship at Google. I appreciate the discussion with him about the interesting

problems we worked together. I am grateful for his invaluable advice regarding re-

search, presentation, writing, and others. He is always ready to help if I encounter

any difficulties in research or life. I also want to thank my localhost Jingtao Wang

who has made my internship at Google such an enjoyable journey.

I had a great time working at Princeton. I would like to thank the (past and

present) members of my research group: Saeed Ghadimi, Xudong Li, Lin Yang, Jian

Ge, Galen Cho, Hao Lu, Yifan Sun, Yaqi Duan, Hao Gong, Zheng Yu. I would like to

iv

thank my other friends in Princeton: Weichen Wang, Junwei Lu, Levon Avanesyan,

Zongxi Li, Suqi Liu, Cong Ma, Kaizheng Wang, Xiaozhou Li, Nanxi Kang, Xin Jin,

Xinyi Fan, Linpeng Tang, Haoyu Zhang, Kelvin Zou, Yixin Sun, Yinda Zhang, Jun

Su, Zhengyu Song and Yixin Tao. Thank you all for making this journey enjoyable.

Finally, I would like to thank my parents Tianlei Chen and Zhixia Tao for their

unconditional love and support throughout my life. I would also like to thank my

beloved partner, Yiqin Shen, for her love, inspiration, and company. I dedicate this

thesis to them.

v

To my family.

vi

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 1

2 Stochastic Primal-Dual Methods for Markov Decision Problems 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Preliminaries and Formulation of the Problem . . . . . . . . . . . . . 10

2.2.1 Finite-State Discounted-Reward MDP . . . . . . . . . . . . . 10

2.2.2 Finite-State Finite-Horizon MDP . . . . . . . . . . . . . . . . 11

2.3 Value-Policy Duality of Markov Decision Processes . . . . . . . . . . 14

2.3.1 Finite-State Discounted-Reward MDP . . . . . . . . . . . . . 14

2.3.2 Finite-State Finite-Horizon MDP . . . . . . . . . . . . . . . . 18

2.4 Stochastic Primal-Dual Methods for Markov Decision Problems . . . 22

2.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5.1 Sample Complexity Analysis of Discounted-Reward MDPs . . 27

2.5.2 Sample Complexity Analysis of Finite-Horizon MDPs . . . . . 30

2.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii

3 Primal-Dual π Learning Using State and Action Features 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Preliminaries and Formulation of the Problem . . . . . . . . . . . . . 40

3.2.1 Infinite-Horizon Average-Reward MDP . . . . . . . . . . . . . 41

3.2.2 Infinite-Horizon Discounted-Reward MDP . . . . . . . . . . . 44

3.3 Model Reduction of MDP using State and Action Features . . . . . . 45

3.3.1 Using State and Action Features As Bases . . . . . . . . . . . 46

3.3.2 Reduced-Order Bellman Saddle Point Problem for Average-

Reward MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.3 Reduced-Order Bellman Saddle Point Problem for Discounted-

Reward MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Primal-Dual π Learning for Average-Reward MDPs . . . . . . . . . . 50

3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.2 Sample Complexity Analysis . . . . . . . . . . . . . . . . . . . 52

3.5 Primal-Dual π Learning for Discounted-Reward MDPs . . . . . . . . 63

3.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5.2 Sample Complexity Analysis . . . . . . . . . . . . . . . . . . . 64

3.6 Related Literatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4 Complexity Lower Bounds of Discounted-Reward MDPs 76

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 Family of Hard Instances of MDP . . . . . . . . . . . . . . . . . . . . 84

4.4.1 Hard Instances of Standard MDP . . . . . . . . . . . . . . . . . 84

4.4.2 Hard Instances of CDP MDP and Binary Tree MDP . . . . . . . . 85

4.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

viii

4.5.1 A Sub-Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.2 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . 88

4.5.3 Proofs of Theorem 4.3.2 and 4.3.3 . . . . . . . . . . . . . . . . 92

4.5.4 Proof of Lemma 4.5.1 . . . . . . . . . . . . . . . . . . . . . . . 93

5 Complexity Lower Bounds of Markov Games 97

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3 Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4 Hard Instances of Markov Games . . . . . . . . . . . . . . . . . . . . 107

5.4.1 Hard Instances of Array Markov Game . . . . . . . . . . . . . 107

5.4.2 Hard Instances of CDP Markov Game and Binary Tree Markov

Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.5.1 Computational Model . . . . . . . . . . . . . . . . . . . . . . 113

5.5.2 Proof of Theorem 5.3.1 . . . . . . . . . . . . . . . . . . . . . . 114

5.5.3 Proof of Theorem 5.3.2 . . . . . . . . . . . . . . . . . . . . . . 118

5.6 Extension to Markov Games with Irreducibility Property . . . . . . . 120

A Proofs in Chapter 2 122

A.1 Analysis of the SPD-dMDP Algorithm 2.1 . . . . . . . . . . . . . . . 122

A.1.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . 123

A.1.2 Proof of Theorem 2.5.1 . . . . . . . . . . . . . . . . . . . . . . 130



A.2 Analysis of the SPD-fMDP Algorithm 2.2 . . . . . . . . . . . . . . . 134

A.2.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . 134


ix



B Proofs in Chapter 3 150

B.1 Proof of Lemmas 3.4.1 and 3.4.2 . . . . . . . . . . . . . . . . . . . . . 150

B.2 Proof of Lemmas 3.5.1 and 3.5.2 . . . . . . . . . . . . . . . . . . . . . 155

C Proofs in Chapter 5 159

C.1 Supporting Lemmas for Theorem 5.3.1 . . . . . . . . . . . . . . . . . 159

C.1.1 Proof of Lemma 5.5.1 . . . . . . . . . . . . . . . . . . . . . . . 159

C.1.2 Proof of Lemma 5.5.2 . . . . . . . . . . . . . . . . . . . . . . . 163

C.2 Supporting Lemmas for Theorem 5.3.2 . . . . . . . . . . . . . . . . . 165

C.2.1 Proof of Lemma 5.5.3 . . . . . . . . . . . . . . . . . . . . . . . 165

C.2.2 Proof of Lemma 5.5.4 . . . . . . . . . . . . . . . . . . . . . . 167

Bibliography 168

x

List of Figures

4.1 Input of Standard MDP: Arrays of Transition Probabilities forM1 ∈M1

and M2 ∈M2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2 Input of CDP MDP : Arrays of Cumulative Probabilities for M3 ∈ M3

and M4 ∈M4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Input of Binary Tree MDP: (a) The snippet of the binary tree of M3 ∈

M3, where all transitions are to bad states. (b) The snippet of the

binary tree of M4 ∈ M4, where there is some (s, a) which transitions

to a good state sG,2 with probability �. . . . . . . . . . . . . . . . . . 87

5.1 A hard instance of MDP: In case of Type I states, the state transitions

to a random bad state s′ ∈ SB with probability 1 for all action a ∈ AN .

In case of Type II states, the state transitions to a rewarding state

s̄ ∈ SG with probability 1 under some action ā ∈ AN . . . . . . . . . . 107

5.2 A hard instance of CDP Markov Game and Binary Tree Markov Game: In

case of Type III states, the state transitions to the first 1/� states each

with probability � for all action a ∈ AN . In case of Type IV states, the

state transitions to a rewarding state s̄ ∈ SG with probability � under

some action ā ∈ AN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

xi

Chapter 1

Introduction

Reinforcement learning lies at the intersection between control, machine learning,

and stochastic processes. Recent empirical successes demonstrate that reinforcement

learning, combined with sophisticated function approximation techniques (e.g., deep

neural networks), can conquer even the most challenging tasks in video and board

games [81, 97]. These successes have drawn attention to the research of the Markov

decision process, which provides a mathematical framework for dynamic program-

ming, stochastic control, and reinforcement learning. The Markov decision process

is a controlled random walk over a state space S, where in each state s ∈ S, one

can choose an action a from an action space A so that the random walk transitions

to another state s′ ∈ S with probability P (s, a, s′) and an immediate reward r(s, a).

The key goal is to identify an optimal policy that, when running on the Markov

decision process, maximize some cumulative function of rewards. We use the expres-

sion Markov decision problem for a Markov decision process together with such an

optimality criterion [90].

Researchers have been developing methods for solving Markov decision problems

for decades. There are three major approaches for solving the MDP: the value iter-

ation method, the policy iteration method, and the linear programming method. In

1

1957, Bellman [8] developed an iteration algorithm, called value iteration, to compute

the optimal total reward function, which is guaranteed to converge in the limit. In

policy iteration [58], the algorithm alternates between a value determination phase,

in which the current policy is evaluated, and a policy improvement phase, in which

the policy is updated according to the evaluation. Around the same time, D’Epenoux

[39] and de Ghellinck [38] discovered that the MDP has an LP formulation, allowing

it to be solved by general LP methods such as the simplex method [35], the Ellipsoid

method [60] or the interior-point algorithm [59].

As the notion of computational complexity emerged, there were tremendous efforts

in analyzing the complexity of MDPs and methods for solving them. All of the

three methods mentioned above are proved to be able to solve MDPs in polynomial

time [104, 71, 116, 68]. Here, polynomial time means that the number of arithmetic

operations needed to compute an exact optimal policy is bounded by a polynomial in

the number of states, actions and the bit-size of the data. However, though being able

to solve MDPs in polynomial time, all the methods have a computational complexity

that is superlinear to the input size Θ(|S|2|A|). This is mainly due to the nature

of the goal to find the exact optimal policy – it requires the full knowledge of the

system, which suffers from the curse of dimensionality.

In this thesis, we study the problem of finding the approximately optimal policies

for MDPs: is there a way to trade the precision of the exact optimal policy for a

better time complexity? In particular, we are interested in reducing the complexity’s

dependence on the size of the state space and the size of the action space. To serve this

purpose, we developed a class of stochastic primal-dual methods for solving MDPs.

We formulate the core policy optimization problem of the MDP as a stochastic saddle

point problem. By utilizing the value-policy duality structure of MDPs and leveraging

the powerful machinery from optimization, our primal-dual algorithms can find the

approximately optimal policies with small space and computational complexity. By

2

adopting linear models to represent the high-dimensional value function and state-

action density functions, our algorithm is capable of scaling to problems with infinite

and continuous state spaces.

Meanwhile, we are interested in the computational complexity lower bounds for

solving MDPs. Here, the computational complexity lower bounds mean the minimum

number of arithmetic operations or queries to the input data, as a function of |S| and

|A|, that is required for any algorithms to solve the MDP with high probability. In

contrast to the large volumes of upper bound results, there are fewer results on the

complexity lower bounds [87, 52]. In this thesis, we establish the first computational

complexity lower bounds for solving MDPs. The analysis is also extended to the

study of the computational complexity for two-player turn-based Markov games. Our

results show that if we have a simulator that can sample according to the transition

probability function in O(1), the lower bounds have reduced dependence on the state

number. These results suggest a fundamental difference between problems with and

without a simulator, which explains the success of the simulation-based approaches

for Markov decision problems in recent years [97].

Here we give an overview of each chapter in the thesis. Chapter 2 and Chapter 3

study the complexity of MDPs from the positive side, developing new algorithms with

provable convergence rate guarantees. Chapter 4 and Chapter 5 study the complexity

from the negative side, establishing the first computational complexity lower bounds

for solving MDPs and Markov games.

Stochastic primal-dual methods for Markov decision problems (Chapter 2

[109, 23])

We study the online estimation of the optimal policy of a Markov decision process

(MDP). Central to the problem is to learn the optimal policy and/or the value function

of the system. In the empirically successful actor-critic algorithm [62], the searching

3

procedure updates the value function and the policy simultaneously. In essence, it uses

the value estimates to update the policy more accurately and uses the policy estimate

to update the value function with less variance. In Chapter 2, we justify this procedure

using the optimization theory. We show the dual relationship between the value

function and the policy function in a saddle point formulation of the Markov decision

process. The interpretation demonstrates the primal-dual nature of the alternating

update methods. Based on this insight, we propose a class of Stochastic Primal-Dual

(SPD) methods that exploit the inherent minimax duality of Bellman equations. The

SPD methods update a few coordinates of the value and policy estimates as a new

state transition is observed. We prove the convergence rate of these algorithms and

show that they are both space-efficient and computationally efficient.

Primal-dual π learning using state and action features (Chapter 3 [22])

Approximate linear programming (ALP) represents one of the major algorithmic fam-

ilies to solve large-scale Markov decision processes (MDPs). In this chapter, we study

a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm

called primal-dual π learning for reinforcement learning when a sampling oracle is

provided. This algorithm enjoys several advantages. First, it adopts linear models to

represent the high-dimensional value function and state-action density functions, re-

spectively, using given state and action features. Its run-time complexity depends on

the number of features, not the size of the underlying MDPs. Second, our algorithm

and analysis apply to an arbitrary state space, even if the state space is continuous

and infinite. Third, it operates in a fully online fashion without having to store any

sample, thus having minimal memory footprint. Fourth, we prove that it is sample-

efficient, solving for the optimal policy to high precision with a sample complexity

linear in the dimension of the parameter space.

4

Complexity lower bounds of discounted-reward MDPs (Chapter 4)

We study the computational complexity of the infinite-horizon discounted-reward

Markov Decision Problem (MDP) with a finite state space S and a finite action space

A. We show that any randomized algorithm needs a running time at least Ω(|S|2|A|)

to compute an �-optimal policy with high probability. We consider two variants of

the MDP where the input is given in specific data structures, including arrays of

cumulative probabilities and binary trees of transition probabilities. For these cases,

we show that the complexity lower bound reduces to Ω(|S||A|�

). These results reveal

a surprising observation that the computational complexity of the MDP depends on

the data structure of the input.

Complexity lower bounds of Markov games (Chapter 5)

Recent years have seen the huge success of reinforcement learning in strategic games

for artificial intelligence [81, 97]. The reinforcement learning agent learns to play

optimal actions from games of self-play. In game theory, such systems are modeled

by stochastic games, which have been studied for decades [95]. Surprisingly, it is

only in recent years that 2-player turn-based stochastic games have been proved to

be solvable in strongly polynomial time [53]. In Chapter 5, we studied the compu-

tational complexity lower bounds for solving 2-player turn-based stochastic games.

Our results show that the worst-case computational complexity lower bounds depend

on the data structure used in representing the stochastic game. When the transition

probabilities of the game are given in the form of arrays, the lower bounds have a

linear dependence on the number of states and actions. When the probabilities are

given in the format that allows efficient simulation of the game, the lower bounds

have a sublinear dependence on the number of states and actions. Our results sug-

gest that the efficient simulator used in reinforcement learning methods might be a

contributing factor to its empirical success.

5

Chapter 2

Stochastic Primal-Dual Methods

for Markov Decision Problems

2.1 Introduction

Markov decision process (MDP) is one of the most basic models of dynamic pro-

gramming, stochastic control and reinforcement learning; see the textbook references

[10, 13, 100, 90]. Given a controllable Markov chain and the distribution of state-

to-state transition rewards, the aim is to find the optimal action to perform at each

state in order to maximize the expected overall reward. MDP and its numerous vari-

ants are widely applied in engineering systems, artificial intelligence, e-commerce and

finance. Classical solvers of MDP require full knowledge of the underlying stochastic

process and reward distribution, which are often not available in practice.

In this chapter, we study both the infinite-horizon discounted-reward MDP and

the finite-horizon MDP. In both cases, we assume that the MDP has a finite state

space S and a finite action space A. We focus on the model-free learning setting

where both transition probabilities and transitional rewards are unknown. Instead,

a simulation oracle is available to generate random state-to-state transitions and

6

transitional rewards. The simulation oracle is able to model offline retrieval of static

empirical data as well as live interaction with real-time simulation systems. The

algorithmic goal is to estimate the optimal policy of the unknown MDP based on

empirical state transitions, without any prior knowledge or restrictive assumption

about the underlying process. In the literature of approximate dynamic programming

and reinforcement learning, many methods have been developed, and some of them

are proved to achieve near-optimal performance guarantees in certain senses; recent

examples including [85, 34, 67, 66, 109]. Although researchers have made significant

progress in developing reinforcement learning methods, it remains unclear whether

there is an approach that achieves both theoretical optimality and practical scalability.

This is an active area of research.

In this chapter, we present a novel approach motivated by the linear programming

formulation of the nonlinear Bellman equation. We formulate the Bellman equation

into a stochastic saddle point problem, where the optimal primal and dual solutions

correspond to the optimal value and policy functions, respectively. We propose a class

of Stochastic Primal-Dual algorithms (SPD) for the discounted MDP and the finite-

horizon MDP. Each iteration of the algorithms updates the primal and dual solutions

simultaneously using noisy partial derivatives of the Lagrangian function. We show

that one can compute a noisy partial derivative efficiently from a single observation of

the state transition. The SPD methods are stochastic analogs of the primal-dual iter-

ation for linear programming. They also involve alternative projections onto specially

constructed sets. The SPD methods are straightforward to implement and exhibit

favorable space complexity. To analyze its sample complexity, we adopt the notion

of “Probably Approximately Correct” (PAC), which means to achieve an �-optimal

policy with high probability using sample size polynomial to the problem parameters.

The main contributions of this chapter are four-folded:

7

1. We study the basic linear algebra of reinforcement learning. We show that

the optimal value and optimal policy are dual to each other, and they are

the solutions to a stochastic saddle point problem. The value-policy duality

implies a convenient algebraic structure that may facilitate efficient learning

and dimension reduction.

2. We develop a class of stochastic primal-dual (SPD) methods that maintain a

value estimate and a policy estimate and update their coordinates while pro-

cessing state-transition data incrementally. The SPD methods exhibit superior

space and computational scalability. They require O(|S| × |A|) space for dis-

counted MDP and O(|S| × |A| × H) space for finite-horizon MDP. The space

complexity of SPD is sublinear to the input size of the MDP model. For dis-

counted MDP, each iteration updates two coordinates of the value estimate and

a single coordinate of the policy estimate. For finite-horizon MDP, each iter-

ation updates 2H coordinates of the value estimate and H coordinates of the

policy estimate.

3. For discounted MDP, we develop the SPD-dMDP Algorithm 2.1. It yields

an �-optimal policy with probability at least 1 − δ using the following sample

size/iteration number

O(|S|4|A|2σ2

(1− γ)6�2ln

(1

δ

)),

where γ ∈ (0, 1) is the discount factor, |S| and |A| are the sizes of the state space

and the action space, σ is a uniform upperbound of state-transition rewards.

We obtain the sample complexity results by analyzing the duality gap sequence

and applying the Bernstein inequality to a specially constructed martingale.

The analysis is novel to the authors’ best knowledge.

4. For finite-horizon MDP, we develop the SPD-fMDP Algorithm 2.2. It yields

an �-optimal policy with probability at least 1 − δ using the following sample8

size/iteration number

O(|S|4|A|2H6σ2

�2ln

(1

δ

)),

where H is the total number of periods. The key aspect of the finite-horizon

algorithm is to adapt the learning rate/stepsize for updates on different periods.

In particular, the algorithm has to update the policies associated with the earlier

periods more aggressively than update those associated with the later periods.

The SPD is a model-free method and applies to a wide class of dynamic programming

problems. Within the scope of this chapter, the sample transitions are drawn from a

static distribution. We conjecture that the sample complexity results can be improved

by allowing exploitation, i.e., adaptive sampling of actions. The results of this chapter

suggest that the linear duality of MDP bears convenient structures yet to be fully

exploited.

Chapter Organization Section 2.2 reviews the basics of discounted and finite-

horizon MDP and related works in this area. Section 2.3 studies the duality between

optimal values and policies. Section 2.4 presents the SPD-dMDP and the SPD-fMDP

algorithms and discuss their implementation and complexities. Section 2.5 presents

the main results and the proofs are deferred to the appendix.

Notations All vectors are considered as column vectors. For a vector x ∈ Rn, we

denote by xT its transpose, and denote by ‖x‖ =√xTx its Euclidean norm. For a

matrix A ∈ Rn×n, we denote by ‖A‖ = max{‖Ax‖ | ‖x‖ = 1} its induced Euclidean

norm. For a set X ⊂ Rn and vector y ∈ Rn, we denote by ΠX{y} = argminx∈X‖y −

x‖2 the Euclidean projection of y on X , where the minimization is always uniquely

attained if X is nonempty, convex and closed. We denote by e = (1, . . . , 1)T the

vector with all entries equaling 1, and we denote by ei = (0, ..., 0, 1, 0, ..., 0)T the

9

vector with its i-th entry equaling 1 and other entries equaling 0. For set X , we

denote its cardinality by |X |.

2.2 Preliminaries and Formulation of the Problem

In this section, we review the basic models of Markov decision processes.

2.2.1 Finite-State Discounted-Reward MDP

We consider a discounted MDP described by a tuple M = (S,A,P , r, γ), where S

is a finite state space, A is a finite action space, γ ∈ (0, 1) is a discount factor. If

action a is selected while the system is in state i, the system transitions to state j

with probability Pa(i, j) and incurs a random reward r̂ija ∈ [0, σ] with expectation

rija.

Let π : S 7→ A be a policy that maps a state i ∈ S to an action π(i) ∈ A. Consider

the Markov chain under policy π. We denote its transition probability matrix by Pπ

and denote its transitional reward vector by rπ∗ , i.e.,

Pπ(i, j) = Pπ(i)a(i, j), rπ(i) =∑j∈S

Pπ(i)(i, j)rijπ(i), i, j ∈ S.

The objective is to find an optimal policy π∗ : S 7→ A such that the infinite-horizon

discounted reward is maximized, regardless of the initial state:

maxπ:S7→A

Eπ

[∞∑k=0

γkr̂ikik+1π(ik)

],

where γ ∈ (0, 1) is a discount factor, (i0, i1, . . .) are state transitions generated by the

Markov chain under policy π, and the expectation is taken over the entire process.

We assume throughout that there exists a unique optimal policy π∗ to the MDP tuple

M = (S,A, P, r, γ). In other words, there exists one optimal action for each state.

10

We review the standard definitions of value functions.

Definition 2.2.1. The value vector vπ ∈ R|S| of a fixed policy π is defined as

vπ(i) = Eπ

[∞∑k=0

γkr̂ikik+1π(ik)

∣∣∣ i0 = i] , i ∈ S.Definition 2.2.2. The optimal value vector v∗ ∈ R|S| is defined as

v∗(i) = maxπ:S7→A

Eπ

[∞∑k=0

γkr̂ikik+1π(ik)

∣∣∣ i0 = i] i ∈ S.For the sample complexity analysis of the proposed algorithm, we need a notion

of sub-optimality of policies. We give its definition as below.

Definition 2.2.3. We say that a policy π is absolute-�-optimal if

maxi∈S|vπ(i)− v∗(i)| ≤ �.

If a policy is absolute-�-optimal, it achieves �-optimal reward regardless of the ini-

tial state distribution. We note that the absolute-�-optimality is one of the strongest

notions of sub-optimality for policies. In comparison, some literature analyzes the

expected sub-optimality of a policy when the state i follows a certain distribution.

2.2.2 Finite-State Finite-Horizon MDP

We also consider the finite-horizon Markov decision process, which can be formu-

lated as a tuple M = (S,A, H,P , r), where S is a finite state space with transition

probabilities encoded by P = (Pa)a∈A ∈ R|S×S×A|, A is a finite action space and H

is the horizon. If action a is selected while the system is in state i ∈ S at period

h = 0, . . . , H − 1, the system transitions to state j and period h+ 1 with probability

Pa(i, j) and incurs a random reward r̂ija with expectation rija. For the simplicity

11

of presentation, we assume that the reward we receive does not depend on the time

period. Our algorithm can be readily extended to the case when the reward varies

with the time period. We assume that both P and r are unknown but they can be

estimated by sampling.

We augment the state space with the time period to obtain an augmented Markov

chain. Now we have a replica of S at each period h, denoted by Sh. Let S[H]

be the state space of the augmented MDP where [H] denotes the set of integers

{0, 1, . . . , H − 1}. If we select action a in state (i, h) ∈ S[H], the state transitions

to a new state (j, h + 1) ∈ S[H] with probability Pa(i, j). The transition incurs a

random reward r̂ija ∈ [0, σ] with expectation rija. At period H − 1, the state i will

transition to the terminal state with reward r̂ija. In the rest of the chapter, we use

Πa(i′, j′) to denote the transition probability of the augmented Markov chain where

i′, j′ ∈ {(i, h)|i ∈ S, h ∈ [H]}.

Let π = (π0, . . . , πH−1) be a sequence of one-step policies such that πh maps a

state i ∈ S to an action πh(i) ∈ A in the h-th period. Consider the augmented

Markov chain under policy π. We denote its transition probability matrix by Ππ,

where Ππ ((i, h), (j, h+ 1)) = Pπh(i)(i, j) for all h ∈ [H − 1], i, j ∈ S, which is given

by

Ππ =

0 Pπ0 0 . . . 0

0 0 Pπ1 . . . 0

......

.... . .

...

0 0 0 . . . PπH−2

0 0 0 . . . 0

.

We denote ra ∈ R|S| to be the expected state transition reward under action a such

that

ra(i) =∑j∈S

Pa(i, j)rija, i ∈ S.

We review the standard definitions of value functions for finite-horizon MDP.

12

Definition 2.2.4. The h-period value function vπh : Sh 7→ Rn under policy π is defined

as

vπh(i) = Eπ

[H−1∑τ=h

r̂iτ iτ+1πτ (iτ )

∣∣∣∣ ih = i],∀ h ∈ [H], i ∈ S.

The random variables (ih, ih+1, . . . ) are state transitions generated by the Markov

chain under policy π starting from state i and period h, and the expectation Eπ[·]

is taken over the entire process. We denote the overall value function to be v =

(vT0 , . . . , vTH−1)

T ∈ R|S|H .

The objective of the finite-horizon MDP is to find an optimal policy π∗ : S[H] 7→ A

such that the finite-horizon reward is maximized, regardless of the starting state.

Based on the optimal policy, we define the optimal value function as follows.

Definition 2.2.5. The optimal value vector v∗ = (v∗h)h∈[H] ∈ RH×|S| is defined as

v∗h(i) = maxπ:S[H] 7→A

Eπ

[H−1∑τ=h


∣∣∣∣ ih = i]

= Eπ∗

[H−1∑τ=h


∣∣∣∣ ih = i],

for all h ∈ [H], i ∈ S.

In order to analyze the sample complexity, we use the following notion of absolute-

�-optimal for finite-horizon MDP.

Definition 2.2.6. We say that a policy π is absolute-�-optimal if

maxh∈[H],i∈S

|vπh(i)− v∗h(i)| ≤ �.

Note that if a policy is absolute-�-optimal, it achieves an �-optimal cumulative

reward from all states and all intermediate periods. This is one of the strongest

notions of sub-optimality for finite-horizon policies.

13

2.3 Value-Policy Duality of Markov Decision Pro-

cesses

In this section, we study the Bellman equation of the Markov decision process from

the perspective of linear duality. We show that the optimal value and policy functions

are dual to each other and they are the solutions to a special saddle point problem.

We analyze the value-policy duality for the infinite-horizon discounted-reward case

and the finite-horizon case separately.

2.3.1 Finite-State Discounted-Reward MDP

Consider the discounted MDP described by the tupleM = (S,A,P , r, γ) as in Section

2.2.1. According to the theory of dynamic programming [10], a vector v∗ is the optimal

value function to the MDP if and only if it satisfies the following |S| × |S| system of

equations, known as the Bellman equation, given by

v∗(i) = maxa∈A

{γ∑j∈S

Pa(i, j)v∗(j) +

∑j∈S

Pa(i, j)rija

}, i ∈ S. (2.3.1)

When γ ∈ (0, 1), the Bellman equation has a unique fixed point solution v∗, and it

equals to the optimal value function of the MDP. Moreover, a policy π∗ is an optimal

policy for the MDP if and only of it attains the minimization in the Bellman equation.

Note that this is a nonlinear system of fixed point equations. Interestingly, the Bell-

man equation (2.3.1) is equivalent to the following |S|× (|S||A|) linear programming

(LP) problem (see Puterman, 2014 [90] Section 6.9. and the paper by de Farias and

Van Roy [36]):

minimize ξTv

subject to (I− γPa) v − ra ≥ 0, a ∈ A,(2.3.2)

14

where ξ is an arbitrary vector with positive entries, Pa ∈ R|S|×|S| is matrix whose

(i, j)-th entry equals to Pa(i, j), I is the identity matrix with dimension |S| × |S|

and ra ∈ R|S| is the expected state transition reward under action a given by ra(i) =∑j∈S Pa(i, j)rija, i ∈ S. The dual linear program of (2.3.2) is

maximize∑a∈A

λTa ra

subject to∑a∈A

(I− γP Ta

)λa = ξ, λa ≥ 0, a ∈ A.

(2.3.3)

We will show that the optimal solution λ∗ to the dual problem (2.3.3) corresponds

to the optimal policy π∗ of the MDP. The duality between the optimal value vector

and the optimal policy is established in Theorem 2.3.1. We remark that part of the

results was known in the classical literature of MDP; see Puterman, 2014 [90] Section

6.9. We provide a short proof for the completeness of the analysis.

Theorem 2.3.1 (Value-Policy Duality for Discounted MDP). Assume that the

discounted-reward infinite-horizon MDP tuple M = (S,A, P, r, γ) has a unique

optimal policy π∗. Then (v∗, λ∗) is the unique pair of primal and dual solutions to

(2.3.2), (2.3.3) if and only if

v∗ = (I− γPπ∗)−1rπ∗ ,(λ∗π∗(i),i

)i∈S

= (I− γ(Pπ∗)T )−1ξ, λ∗a,i = 0 if a 6= π∗(i).

Proof. The proof is based on the fundamental property of linear programming, i.e.,

(v∗, λ∗) is the optimal pair of primal and dual solutions if and only if:

(a) v∗ is primal feasible, i.e., (I− γPa) v∗ − ra ≥ 0 for all a ∈ A.

(b) λ∗ is dual feasible, i.e.,∑

a∈A(I− γP Ta

)λ∗a = ξ and λ

∗a ≥ 0 for all a ∈ A.

15

(c) (v∗, λ∗) satisfies the complementarity slackness condition

λ∗a,i · (v∗i − γPa,iv∗ − ra,i) = 0 ∀i ∈ S, a ∈ A,

where λ∗a,i is the i-th element of λ∗a and Pa,i is the i-th row of Pa.

Suppose that (v∗, λ∗) is primal-dual optimal. As a result, it satisfies (a), (b), (c)

and v∗ is the optimal value vector. By the definition of the optimal value function, we

know that v∗i−γPπ∗(i),iv∗−rπ∗(i),i = 0. Since π∗ is unique, we have v∗i > γPa,iv∗+ra,i =

0 if a 6= π∗(i). As a result, we have λ∗a,i = 0 for all a 6= π∗(i). It means that the

optimal dual variable λ∗ has exactly |S| nonzeros, corresponding to |S| active row

constraints of the primal problem (2.3.2). We combine this observation with the dual

feasibility relation∑

a∈A(I− γP Ta

)λ∗a = ξ and obtain

(I− γ(Pπ∗)T )(λ∗π∗(i),i

)i∈S

= ξ.

Note that all eigenvalues of Pπ∗ belong to the unit ball, therefore (I − γ(Pπ∗)T ) is

invertible. Then we have(λ∗π∗(i),i

)i∈S

= (I − γ(Pπ∗)T )−1ξ, which together with the

complementarity condition implies that λ∗ is unique. Similarly, we can show that

v∗ = (I− γPπ∗)−1rπ∗ from the primal feasibility and slackness condition.

Now suppose that (v∗, λ∗) satisfies the three conditions stated in Theorem 2.3.1.

Then we obtain (a),(b),(c) directly, which proves that (v∗, λ∗) is primal-dual optimal.

�

Theorem 2.3.1 suggests a critical correspondence between the optimal dual solu-

tion λ∗ and the optimal policy π∗. In particular, one can recover the optimal policy

π∗ from the basis of λ∗ as follows:

π∗(i) = a, if λ∗a,i > 0.

16

In other words, finding the optimal policy is equivalent to finding the basis of the

optimal dual solution. This suggests that learning the optimal policy is a special case

of stochastic linear programs.

Saddle Point Formulation of Discounted MDP We rewrite the LP program

(2.3.2) into an equivalent minimax problem, given by

minv∈R|S|

maxλ≥0

L(v, λ) = ξTv +∑a∈A

λTa ((γPa − I)v + ra) . (2.3.4)

The primal variable v is of dimension |S|, and the dual variable

λ = (λa)a∈A = (λa,i)a∈A,i∈S

is of dimension |S| · |A|. Each subvector λa ∈ R|S| is the vector multiplier correspond-

ing to constraint inequalities (I− γPa) v − ra ≥ 0. Each entry λa,i > 0 is the scalar

multiplier associated with the ith row of (I− γPa) v − ra ≥ 0.

In order to develop an efficient algorithm, we modify the saddle point problem as

follows

minv∈R|S|

maxλ∈R|S×A|

{L(v, λ) = ξTv +

∑a∈A

λTa ((γPa − I)v + ra)

},

subject to v ∈ V , λ ∈ Ξ ∩∆

(2.3.5)

where

V ={v∣∣∣ v ≥ 0, ‖v‖∞ ≤ σ

1− γ

},

Ξ =

{λ∣∣∣ ∑a∈A

λa,i ≥ ξi, ∀i ∈ S

}, ∆ =

{λ∣∣∣ λ ≥ 0, ‖λ‖1 = ‖ξ‖1

1− γ

}.

(2.3.6)

17

We will show later that v∗ and λ∗ belong to V and Ξ∩∆ respectively (Lemma A.1.1).

As a result, the modified saddle point problem (2.3.5) is equivalent to the original

problem (2.3.4).

2.3.2 Finite-State Finite-Horizon MDP

Consider the finite-horizon Markov decision process, which is described by a tuple

M = (S,A, H,P , r) as in Section 2.2.2. The Bellman equation of finite-horizon

MDP is given by

v∗h(i) = maxa∈A

{ra(i) +

∑j∈S

Pa(i, j)v∗h+1(j)

},∀ h ∈ [H], i ∈ S,

v∗H = 0,

(2.3.7)

where Pa is the transition probability matrix using a fixed action a. The vector form

of the Bellman equation is

v∗h = maxa∈A

{ra(i) + Pav

∗h+1

}, ∀ h ∈ [H], i ∈ S,

v∗H = 0,

where the maximization is carried out component-wise. A vector v∗ satisfies the

Bellman equation if and only if it is the optimal value function.

The Bellman equation is equivalent to the following linear program:

minimizeH−1∑h=0

ξTh vh

subject to vh ≥ Pavh+1 + ra, a ∈ A, h ∈ [H]

vH = 0,

(2.3.8)

18

where ξ = (ξ0, . . . , ξH−1) is an aribrary vector with positive entries and v =

(vT0 , . . . , vTH−1)

T is the primal variable. The above linear program has |S||A|H

constraints and |S|H variables. The dual linear program of (2.3.8) is given by

maximizeH−1∑h=0

∑a∈A

λTh,ara

subject to∑a∈A

(λh,a − P Ta λh−1,a

)= ξh, h ∈ [H]

λ−1,a = 0, λh,a ≥ 0, h ∈ [H], a ∈ A

(2.3.9)

where the dual variable λ = (λh,a)h∈[H],a∈A is a vector of dimension |S||A|H. In the

remainder of the chapter, we use the notation λh to denote the vector λh = (λh,a)a∈A

and λa to denote the vector λa = (λh,a)h∈[H]. We denote by λ∗ the optimal dual

solution. Now we establish the value-policy duality for finite-horizon MDP.

Theorem 2.3.2 (Value-Policy Duality for Finite-Horizon MDP). Assume that the

finite-horizon MDP tuple M = (S,A, H,P , r) has a unique optimal policy π∗ =

(π∗h)h∈[H]. The vector pair (v∗, λ∗) is the unique pair of primal and dual solutions to

(2.3.8), (2.3.9) if and only if:

v∗ = (I−Ππ∗)−1(Rh,π∗)h∈[H],(λ∗h,π∗h(i)

(i))h∈[H],i∈S

= (I−ΠTπ∗)−1ξ, λ∗h,a(i) = 0 if a 6= π∗h(i),

where Ππ∗ is the transition probability under the optimal policy π∗, Rh,π∗ is the expected

transitional reward under policy, i.e., Rh,π∗(i) = rπ∗h(i)(i) for all i ∈ S and h ∈ [H].

Proof. The proof is based on the fundamental property of the linear duality, i.e.,

(v∗, λ∗) is the optimal pair of primal and dual solutions if and only if

(a) v∗ is primal feasible, i.e., v∗h − Pav∗h+1 − ra ≥ 0, ∀ h ∈ [H], a ∈ A.

(b) λ∗ is dual feasible, i.e.,∑

a∈A(λh,a − P Ta λh−1,a

)= ξh, λh,a ≥ 0, ∀ h ∈

[H], a ∈ A.

19

(c) (v∗, λ∗) satisfies the complementarity slackness condition

λ∗h,a(i) · (v∗h(i)− ra(i)− Pa,iv∗h+1) = 0 ∀ h ∈ [H], i ∈ S, a ∈ A,

where λh,a(i) is the i-th element of λh,a and Pa,i is the i-th row of Pa.

Suppose that (v∗, λ∗) is primal-dual optimal. As a result, it satisfies (a), (b), (c)

and v∗ is the optimal value vector. By the definition of the optimal value function,

we know that v∗h(i) − Pπ∗h(i),iv∗h+1 − rπ∗h(i)(i). Since π

∗ is unique, we have v∗h(i) >

Pa,iv∗h+1 + ra(i) = 0 if a 6= π∗h(i). As a result, we have λ∗h,a(i) = 0 for all a 6= π∗h(i).

Together with the dual feasibility relation∑

a∈A(λh,a − P Ta λh−1,a

)= ξh, we have

(I− ΠTπ∗)(λ∗h,π∗h(i)(i)

)h∈[H],i∈S

= ξ,

where Ππ∗ is defined in Section 2.2.2. Note that (I − ΠTπ∗) is invertible as its deter-

minant is equal to one. Then we have(λ∗h,π∗h(i)

(i))h∈[H],i∈S

= (I − ΠTπ∗)−1ξ, which

together with the complementarity condition implies that λ∗ is unique. Similarly, we

can show that v∗ = (I− Ππ∗)−1(Rh,π∗)h∈[H] from the primal feasibility and slackness

condition.

Now suppose that (v∗, λ∗) satisfies the three conditions stated in Theorem 2.3.2.

Then we obtain (a),(b),(c) directly, which proves that (v∗, λ∗) is primal-dual optimal.

�

Saddle Point Formulation of Finite-Horizon MDP According to the La-

grangian duality, we can rewrite the Bellman equation (2.3.8) into the following saddle

point problem:

minv∈V

maxλ∈Ξ∩∆

L(v, λ) =H−1∑h=0

ξTh vh +H−1∑h=0

∑a∈A

λTh,a (ra + Pavh+1 − vh) , (2.3.10)

20

where

V ={v∣∣∣ vh ≥ 0, ‖vh‖∞ ≤ (H − h)σ, ∀ h ∈ [H]} ,

Ξ =

{λ∣∣∣ ∑a∈A

λh,a ≥ ξh, ∀ h ∈ [H]

},∆ =

{λ∣∣∣ λ ≥ 0, ‖λh‖1 = h∑

τ=0

‖ξh‖1, ∀ h ∈ [H]

}.

(2.3.11)

We will prove later that the optimal primal solution v∗ and the optimal dual solutions

λ∗ belong to the additional constraints V and Ξ∩∆ respectively (Lemma A.2.1). As

a result, (v∗, λ∗) is a pair of primal and dual solutions to the saddle point problem

(2.3.10).

We now state the matrix form of the saddle point problem. Let I be the identity

matrix with dimension |S|H and let Πa ∈ R|S|H×|S|H be the augmented transition

probability matrix of the augmented Markov chain, taking the form

Πa =

0 Pa 0 . . . 0

0 0 Pa . . . 0

......

.... . .

...

0 0 0 . . . Pa

0 0 0 . . . 0

.

Then the Lagrangian L(v, λ) is equivalent to

L(v, λ) = ξTv +∑a∈A

λTa (Ra + (Πa − I)v) , (2.3.12)

where Ra = (rTa , . . . , r

Ta )

T , λa = (λh,a)h∈[H]. The partial derivatives of the Lagrangian

are

∇vL(v, λ) = ξ +∑a∈A

(ΠTa − I)λa, ∇λaL(v, λ) = Ra + (Πa − I)v.

In what follows, we will utilize linear structures of the partial derivatives to develop

stochastic primal-dual algorithms.

21

2.4 Stochastic Primal-Dual Methods for Markov

Decision Problems

We are interested in developing algorithms that not only apply to explicitly given

MDP models but also apply to reinforcement learning. In particular, we focus on the

model-free learning setting, which is summarized as below.

Model-Free Learning Setting of MDP

(i) The state space S, the action spaces A, the reward upperbound σ, and the dis-

count factor γ (or the horizon H) are known.

(ii) The transition probabilities P and reward functions r are unknown.

(iii) There is a Sampling Oracle (SO) that takes input (i, a) and generates a new state

j with probabilities Pa(i, j) and a random reward r̂ija ∈ [0, σ] with expectation

rija.

Motivated by the value-policy duality (Theorems 2.3.1, 2.3.2), we develop a class of

stochastic primal-dual methods for the saddle point formulation of Bellman Equation.

In particular, we develop the Stochastic Primal-Dual method for Discounted Markov

Decision Process (SPD-dMDP), as given in Algorithm 2.1.

The algorithm keeps an estimate of the value function v and the dual iterates λ

in the minimax problem (2.3.4) and makes updates to them while sampling random

states and actions. At the k-th iteration, the algorithm samples a random state i and

a random action a. Then the algorithm uses the sampled state and action to compute

noisy gradients of the minimax problem (2.3.4) with respect to the value function v

and the dual iterates λ. After updating the value function and the dual iterates using

the gradients, the algorithm projects v and λ back to the corresponding constraints

22

V and Ξ ∩ ∆ in the optimization problem (2.3.5). Essentially, each iteration is a

stochastic primal-dual iteration for solving the saddle point problem.

We also develop the Stochastic Primal-Dual method for Finite-horizon Markov

Decision Process (SPD-fMDP), as given by Algorithm 2.2. The SPD algorithms

maintain a running estimate of the optimal value function (i.e., the primal solution)

and the optimal policy (i.e., the dual solution). They make simple updates to the

value and policy estimates as new state and reward observations are drawn from the

sampling oracle.

Implementation and Computational Complexities The proposed SPD algo-

rithms are essentially stochastic coordinate descent methods. They exhibit favorable

properties such as small space complexity and small computational complexity per

iteration.

The SPD-dMDP Algorithm 2.1 updates the value and policie estimates by pro-

cessing one sample state-transition at a time. It keeps track of the dual variable λ and

primal variable v, which utilizes |A|× |S|+ |S| = O(|A|× |S|) space. In comparison,

the dimension of the discounted MDP is |A|× |S|2. Thus the space complexity of Al-

gorithm 2.1 is sublinear with respect to the problem size. Moreover, the SPD-dMDP

Algorithm 2.1 is a coordinate descent method. It updates two coordinates of the

value estimation and one coordinate of the policy estimate per each iteration. After

the dual update, the algorithm first projects the dual iterates onto the simplex ∆ and

then projects the dual iterates onto the box constraint Ξ. The projection onto the

simplex can be implemented using O(|S| × |A|) [25]. The projection onto Ξ under

the simplex constraint can be implemented using O(|S| × |A|) arithmetic operations

as well.1 The overall computation complexity per each iteration is O(|S|× |A|) arith-1Consider the projection of x ∈ ∆ onto Ξ∩∆. We can write the problem as miny∈∆

∑a

∑i(ya,i−

xa,i)2, s.t.

∑a ya,j ≥ c,∀j, where c is a constant. The derivative of the objective with respect to ya,i

is 2(ya,i−xa,i). Denote y∗ to be the optimal solution. For two variables y∗a1,i1 , y∗a2,i2

, the derivativesof the objective with respect to them 2(y∗a1,i1 − xa1,i1), 2(y

∗a2,i2

− xa2,i2) must equal to each other

23

Algorithm 2.1 Stochastic Primal-Dual Algorithm for Discounted MDP (SPD-dMDP)

Input: Sampling Oracle SO, n = |S|, m = |A|, γ ∈ (0, 1), σ ∈ (0,∞)Initialize v(0) : S 7→

[0, σ

1−γ

]and λ(0) : S ×A 7→

[0, ‖ξ‖1σ

1−γ

]arbitrarily.

Set ξ = σ√ne

for k = 1, 2, . . . , T doSample i uniformly from SSample a uniformly from ASample j and r̂ija conditioned on (i, a) from SOSet β =

√n/k

Update the primal iterates by

v(k)(i)← max{

min

{v(k−1)(i)− β

(1

mξ(i)− λ(k−1)a (i)

),

σ

1− γ

}, 0

}v(k)(j)← max

{min

{v(k−1)(j)− γβλ(k−1)a (i),

σ

1− γ

}, 0

}v(k)(s)← v(k−1)(s) ∀s 6= i, j

Update the dual iterates by

λ(k− 1

2)

a (i)← λ(k−1)(a, i) + β(γv(k−1)(j)− v(k−1)(i) + r̂ija

)λ(k−

12

)(a′, i′)← λ(k−1)(a′, i′), ∀ (a′, i′) such that a′ 6= a or i′ 6= i

Project the dual iterates by

λ(k) ← ΠΞΠ∆λ(k−12

), where Ξ and ∆ are given by (2.3.6)

end forOuput: Averaged dual iterate λ̂ = 1

T

∑Tk=1 λ

(k) and randomized policy π̂ where

P(π̂(i) = a) = λ̂a(i)∑a∈A λ̂a(i)

metic operations. Therefore SPD Algorithm 2.1 uses sublinear space complexity and

sublinear computation complexity per iteration.

Algorithm 2.2 has a similar spirit with Algorithm 2.1. It keeps track of the

dual variable (λh,a)h∈[H],a∈A (randomized policies for all periods) and primal vari-

if∑

a y∗a,i1

> c,∑

a y∗a,i2

> c. Othewise, we can move the values of y∗a1,i1 , y∗a2,i2

to get a smallerobjective value. Thus we can compute y∗ explicitly as follows. For i such that

∑a xa,i < c, we set

y∗·,i so that∑

a y∗a,i = c. For the remaining variables y

∗a,i, we set the value of y

∗a,i to be xa,i added

by a constant shift so that∑

a

∑i y∗a,i is 1. Note that this projection algorithm also works for other

distance metrics such as KL divergence:∑

a,i ya,i log(ya,i/xa,i).

24

able v0, . . . , vH−1 (value functions for all periods). Algorithm 2.2 is specially designed

for H-period MDP in two aspects. It uses a non-uniform weight vector where ξ0 =eH

and ξh =e

(H−h)(H−h+1) , which places more weight for later periods to balance the

smaller values. It uses larger stepsizes to update policies associated with later pe-

riods more aggressively, while using small stepsizes to update earlier-period policies

more conservatively. The space complexity is O(|S| × |A| × H), which is sublinear

with respect to the problem dimension O(|S|2×|A|×H). Algorithm 2.2 is essentially

a coordinate descent method that involves projection onto simple sets. The compu-

tation complexity per each iteration is O(|S| × |A| ×H), which is mainly due to the

projection step.

Comparisons with Existing Methods Our SPD algorithms are fundamentally

different from the existing reinforcement learning methods. From a theoretical per-

spective, our SPD algorithms are based on the stochastic saddle point formulation of

the Bellman equation. To the authors’ best knowledge, this idea has not been used

in any existing method. From a practical perspective, the SPD methods are easy to

implement and have small space and computational complexity (one of the smallest

compared to existing methods). In what follows, we compare the newly proposed

SDP methods with several popular existing methods.

• The new SPD methods share a similar spirit with the Q-learning and delayed

Q-learning methods. Both of them maintain and update a value for each state-

action pair (i, a). Delayed Q-learning maintains estimates of the value function

at each state-action pair (i, a), i.e., the Q values. Our SPD maintains estimates

of probabilities for choosing each (i, a), i.e., the randomized policy. In both

cases, the values associated with state-action pairs are used to determine how

to choose the actions. Delayed Q-learning uses O(|S||A|) space and O(ln |A|)

25

Algorithm 2.2 Stochastic Primal-Dual Algorithm for Finite-horizon MDP (SPD-fMDP)

Input: Sampling Oracle SO, n = |S|, m = |A|, H, σ ∈ (0,∞)Initialize vh : S 7→ [0, (H − h)σ] and λh : S ×A 7→

[0, n

H−h

],∀ h ∈ [H] arbitrarily

Set ξ0 =eH

and ξh =e

(H−h)(H−h+1) ,∀ h 6= 0for k = 1, 2, . . . , T do

Sample i uniformly from SSample a uniformly from ASample j and r̂ija conditioned on (i, a) from SOUpdate the primal iterates by

v(k)h (i)← max

{min

{v

(k−1)h (i)−

(H − h)2σ√k

(ξh(i)−mλ

(k−1)h,a (i)

), (H − h)σ

}, 0

},

v(k)h (j)← max

{min

{v

(k−1)h (j)−

m(H − h)2σ√k

λ(k−1)h−1,a(i), (H − h)σ

}, 0

},

v(k)h (s)← v

(k−1)h (s) , ∀ h ∈ [H], s 6= i, j

Update the dual iterates by

λ(k− 1

2)

h,a (i)← λ(k−1)h,a (i) +

n

(H − h)2σ√k

(v

(k−1)h+1 (j)− v

(k−1)h (i) + r̂ija

),∀ h ∈ [H]

λ(k− 1

2)

h,a′ (i′)← λ(k−1)h,a (i), ∀ h ∈ [H],∀ a

′, i′ such that a′ 6= a or i′ 6= i

Project the dual iterates by

λ(k) ← ΠΞΠ∆λ(k−12

), where Ξ and ∆ are given by (2.3.11)

end forOuput: Averaged dual iterate λ̂ = 1

T

∑Tk=1 λ

(k) and randomized policy π̂ =

(π̂0, · · · , π̂H−1) where P(π̂h(i) = a) = λ̂h,a(i)∑a∈A λ̂h,a(i)

arithmetic operations per iteration [99]. Our SPD methods enjoy similar com-

putational advantages as the delayed Q-learning method.

• The SPD methods are also related to the class of actor-critic methods. Our

dual variable mimics the actor, while the primal variable mimics the critic.

In particular, the dual update step in SPD turns out to be very similar to

the actor update: both updates use a noisy temporal difference. Actor-critic

methods are two-timescale methods in which the actor updates on a faster scale

26

in comparison to the critic. In contrast, the new SPD methods have only one

timescale: the primal and dual variables are updated using a single sequence of

stepsizes. As a result, SPD methods are more efficient in utilizing new data as

they emerge and achieve O(1/√T ) rate of convergence.

• Upper confidence methods maintain and update a value or interval for each

state-action-state triplet; see the works by Lattimore and Hutter [66], Dann

and Brunskill [34]. These methods use space up to O(|S|2|A|), which is linear

with respect to the size of the MDP model. In contrast, SPD does not estimate

transition probabilities of the unknown MDP and uses only O(|S||A|) space.

To sum up, the main advantage of the SPD methods is the small storage and small

computational complexity per iteration. We note that the main computational com-

plexity of SPD is due to the projection of dual variables.

2.5 Main Results

In this section, we study the convergence of the two SPD methods: the SPD-dMDP

Algorithm 2.1 and the SPD-fMDP Algorithm 2.2. Our main results show that the

SPD methods output a randomized policy that is absolute-�-optimal using finite sam-

ples with high probability. We analyze the cases of discounted MDP and finite-horizon

MDP separately. All the proofs of the theorems are deferred to the appendix.

2.5.1 Sample Complexity Analysis of Discounted-Reward

MDPs

We analyze the SPD-dMDP Algorithm 2.1 as a stochastic analog of a deterministic

primal-dual iteration. We show that the duality gap associated with the primal and

dual iterates decreases to zero with the following guarantee.

27

Theorem 2.5.1 (PAC Duality Gap). For any � > 0, δ ∈ (0, 1), let λ̂ = (λ̂a)a∈A ∈

R|S×A| be the averaged dual iterates generated by the SPD-dMDP Algorithm 2.1 using

the following sample size/iteration number

Ω

(|S|3|A|2σ4

(1− γ)4�2ln

(1

δ

)).

Then the dual iterate λ̂ satisfies

∑a∈A

(λ̂a)T (v∗ − γPav∗ − ra) ≤ �

with probability at least 1− δ.

Proof Outline. The SPD-dMDP Algorithm 2.1 can be viewed as a stochastic approxi-

mation scheme for the saddle point problem (2.3.5). Upon drawing a triplet (ik, ak, jk),

we obtain noisy samples of partial derivatives ∇vL(v(k), λ(k)) and ∇λL(v(k), λ(k)). The

SPD-dMDP Algorithm 2.1 is equivalent to

vk+1 = ΠV [vk − βk

(∇vL(v(k), λ(k)) + �k

)]

λk+1 = ΠΞΠ∆[λk + βk

(∇λL(v(k), λ(k)) + εk

)],

where V , Ξ and ∆ are specially constructed sets, and �k, εk are zero-mean noise

vectors. βk is the stepsize. By analyzing the distance between (vk, λk) and (v∗, λ∗), we

obtain that the squared distance decreases by factors of the duality gap per iteration.

Then we construct a martingale based on the sequences of duality gaps and apply

Bernstein’s inequality. The formal proof is deferred to the appendix.

Note that the dual variable is always nonnegative λ̂ ≥ 0 by the projection on the

simplex ∆. Also note that the nonnegative vector v∗ − (ra + γPav∗) ≥ 0 is a vector

of primal constraint tolerances attained by the primal optimal solution v∗. Theorem

28

2.5.1 essentially gives an error bound for entries of λ̂ corresponding to inactive primal

row constraints.

Now we are ready to present the sample complexity of SPD for discounted MDP.

Theorem 2.5.2 shows that the averaged dual iterate λ̂ gives a randomized policy that

approximates the optimal policy π∗. The performance of the randomized policy can

be analyzed using the diminishing duality gap from Theorem 2.5.1.

Theorem 2.5.2 (PAC Sample Complexity). For any � > 0, δ ∈ (0, 1), let the SPD-

dMDP Algorithm 2.1 iterate with the following sample size/iteration number

Ω

(|S|4|A|2σ2

(1− γ)6�2ln

(1

δ

)),

then the output policy π̂ is absolute-�-optimal with probability at least 1− δ.

Next, we consider how to recover the optimal policy π∗ from the dual iterates λ̂

generated by the SPD-dMDP Algorithm 2.1. Note that the policy space is discrete,

which makes it possible to distinguish the optimal one from others when the estimated

policy π̂ is close enough to the optimal one.

Definition 2.5.1. Let the minimal action discrimination constant d̄ be the minimal

efficiency loss of deviating from the optimal policy π∗ by making a single wrong action.

It is given by

d̄ = min(i,a):π∗(i)6=a

(v∗(i)− γPa,iv∗ − ra(i)).

As long as there exists a unique optimal policy π∗, we have d̄ > 0. A large value

of d̄ means that it is easy to discriminate optimal actions from suboptimal actions.

A small value of d̄ means that some suboptimal actions perform similarly to optimal

actions.

29

Theorem 2.5.3 (Exact Recovery of The Optimal Policy). For any � > 0, δ ∈ (0, 1),

let the SPD-dMDP Algorithm 2.1 iterate with the following sample size

Ω

(|S|4|A|4σ2

d̄2(1− γ)4ln

(1

δ

)).

Let π̂Tr be obtained by rounding the randomized policy π̂ to the nearest deterministic

policy, given by

π̂Tr(i) = argmaxa∈Aλ̂a,i, i ∈ S.

Then P(π̂Tr = π∗

)≥ 1− δ.

To our best knowledge, this is the first result that shows how to recover the exact

optimal policy of reinforcement learning. The discrete nature of the policy space

makes the exact recovery possible.

2.5.2 Sample Complexity Analysis of Finite-Horizon MDPs

Now we analyze the SPD-fMDP Algorithm 2.2. Again we start with the duality gap

analysis. We have the following theorem.

Theorem 2.5.4 (PAC Duality Gap). For any � > 0, δ ∈ (0, 1), let λ̂ = (λ̂a)a∈A ∈

R|S×A| be the averaged dual iterates generated by the SPD-fMDP Algorithm 2.2 using

the following sample size/iteration number

Ω

(|S|4|A|2H2σ2

�2

(ln

1

δ

)).

Then the dual iterate λ̂ satisfies

H−1∑h=0

∑a∈A

(λ̂h,a)T (v∗h − Pav∗h+1 − ra) ≤ �.

with probability at least 1− δ.30

Proof Outline. We can view the SPD-fMDP Algorithm 2.2 as a stochastic approxi-

mation scheme for the saddle point problem (2.3.10). The SPD-fMDP Algorithm 2.2

is equivalent to the following iteration

vk+1h = ΠV [vkh −

(H − h)2σγkn

(∇vhL(v(k), λ(k)) + �k

)]

λk+1h = ΠΞΠ∆

[λkh +

γkm(H − h)2σ

(∇λhL(v(k), λ(k)) + εk

)],

where Ξ and ∆ are given by (2.3.11), �k and εk are zero-mean noise vectors, γk is the

stepsize. The formal proof is deferred to Section 6.

Next we present the sample complexity for the SPD-fMDP Algorithm 2.2. The

analysis is obtained from the duality gap.

Theorem 2.5.5 (PAC Sample Complexity). For any � > 0, δ ∈ (0, 1), let the SPD-

fMDP Algorithm 2.2 iterate with the following sample size

Ω

(|S|4|A|2H6σ2

�2ln

(1

δ

)),

then the output policy π̂ is absolute-�-optimal with probability at least 1− δ.

Next we consider how to recover the optimal policy π∗ from the dual iterates λ̂

generated by the SPD-fMDP Algorithm 2.2. We abuse the notation d̄ to denote the

minimal action discrimination for finite-horizon MDP.

Definition 2.5.2. Let the minimal action discrimination constant d̄ be the minimal

efficiency loss of deviating from the optimal policy π∗ by making a single wrong action.

It is given by

d̄ = min(h,i,a):π∗h(i)6=a

(v∗h(i)− Pa,iv∗h+1 − ra(i)).

As long as there exists a unique optimal policy π∗, we have d̄ > 0. Now we state

our last theorem.

31

Theorem 2.5.6 (Exact Recovery of The Optimal Policy). For any � > 0, δ ∈ (0, 1),

let π̂Tr be the truncated pure policy such that

π̂Trh (i) = argmaxa∈Aλ̂h,a(i), i ∈ Sh.

Let the SPD-fMDP Algorithm 2.2 iterate with the following iteration number/sample

size

Ω

(|S|4|A|4H6σ2

d̄2ln

(1

δ

)).

Then P(π̂Tr = π∗

)≥ 1− δ.

The results of Theorems 2.5.4, 2.5.5 and 2.5.6 for finite-horizon MDP are analogous

to Theorems 2.5.1, 2.5.2 and 2.5.3 for discounted-reward MDP. The horizon H plays

a role similar to the discounted infinite sum∑∞

k=0 γk = 1

1−γ .

2.6 Related Works

Our proposed methods use ideas from the linear program approach for Bellman’s

equations and the stochastic approximation method. The linear program formulation

of Bellman’s equation was known at around the same time when Bellman’s equation

was known; see [10, 90]. Ye [116] shows that the policy iteration of discounted MDP

is a form of the dual simplex method, which is strongly polynomial for the equivalent

linear program and terminates in run time O( |A×S|2

1−γ ). Cogill [28] considered the

exact primal-dual method for MDP with full knowledge and interpreted it as a form

of value iteration. Approximate linear programming approaches have been developed

for solving the large-scale MDP on a low-dimensional subspace, started with de Farias

and Van Roy [36] and followed by Veatch [106] and Abbasi-Yadkori et al. [2].

Our algorithm and analysis are closely related to the class of stochastic approxi-

mation (SA) methods. For textbook references on stochastic approximation, please

32

see [63, 9, 16, 12]. We also use the averaging idea by [89]. In particular, our algorithm

can be viewed as a stochastic approximation method for stochastic saddle point prob-

lems, which was first studied by Nemirovski and Rubinstein [84] without the rate of

convergence analysis. Recent works Dang and Lan [33] and Chen et al. [24] studied

first-order stochastic methods for a class of general stochastic convex-concave saddle

point problems and obtained optimal and near-optimal convergence rates.

In the literature of reinforcement learning, there have been works on dual temporal

difference learning which use primal-dual-type methods, see for examples [112, 72, 75,

76]. These works focused on evaluating the value function for a fixed policy. This is

different from our work, where the aim is to learn the optimal policy. We also remark

that a primal-dual learning method has been considered for the optimal stopping

problem by Borkar et al. [17]. However, no explicit sample complexity analysis is

available.

In the recent several years, a growing body of work has provided various rein-

forcement learning methods that are able to “learn” the optimal policy with sample

complexity guarantees. The notion of “Probably Approximately Correct” (PAC) was

considered for MDP by Strehl et al. [98], which requires that the learning method out-

puts an �-optimal policy using a sample size that is polynomial to parameters of the

MDP and 1/� with high probability. Since then, many methods have been developed

for discounted MDP and proved to achieve various PAC guarantees. Strehl et al. [98]

showed that the R-MAX has sample complexity O(S2A/(�3(1 − γ)6)) and Delayed

Q-learning has O(SA/(�4(1 − γ)8)). Lattimore and Hutter [66] proposed the Upper

Confidence Reinforcement Learning and obtained a PAC upper and lower bound

O(

SA�2(1−γ)3 log

1δ

)under a restrictive assumption: one can only move to two states

from any given state. Lattimore et al. [67] extended the analysis to more general

RL models. Azar et al. [6] showed that the model-based value iteration achieves the

optimal rate O(|S×A| log(|S×A|/δ)

(1−γ)3�2

)for discounted MDP. Dann and Brunskill [34] de-

33

veloped a upper confidence method for fixed-horizon MDP and obtain a near-optimal

rate O(S2AH2

�2ln 1

δ

). They also provide a lower bound Ω(SAH

2

�2ln 1

δ+c). Based on their

PAC bounds, Osband and Van Roy [86] conjecture that the regret lower bound for

reinforcement learning is Ω(√SAT ). Although the above confidence methods achieve

the close-to-optimal PAC complexity in some cases, they require maintaining a con-

fidence interval for each state-action-state triplet. Thus these methods are not yet

satisfactory in the space complexity. It remains unclear whether there is an approach

that achieves both the space efficiency and the near-optimal sample complexity guar-

antee, without estimating the transition probabilities. This motivates the research in

this chapter.

We emphasize that the SPD methods proposed in this chapter differ fundamentally

from the existing methods mentioned above. In contrast, the SPD methods are more

closely related to first-order stochastic methods for convex optimization and convex-

concave saddle point problems. The closest prior work to the current one is the

work and Wang and Chen [109], which proposed the first primal-dual-type learning

method and a loose error bound. No PAC analysis was given in [109]. In this chapter,

we develop a new class of stochastic primal-dual methods, which are substantially

improved in both practical efficiency and theoretical complexity. Practically, the new

algorithms are essentially coordinate descent algorithms involving projections onto

simple sets. As a result, each iteration is straightforward to implement, making the

algorithms practically favorable. Theoretically, these methods come with rigorous

sample complexity guarantees. The results of this chapter provide the first PAC

guarantee for primal-dual reinforcement learning.

34

2.7 Summary

We have presented a novel primal-dual approach for solving the MDP in the model-

free learning setting. A significant practical advantage of primal-dual learning meth-

ods is the small storage and computational efficiency. The SPD methods use O(|S|×

|A|) space and O(|S| × |A|) arithmetic operation per iteration, which is sublinear

with respect to dimensions of the MDP. We show that the SPD output an absolute-�-

optimal solution using O( |S|4|A|2�2

) samples. In comparison, it is known that the sample

complexity of reinforcement learning is bounded from below by O( |S||A|�2

) in a slightly

different setting [98, 34]. Clearly, our sample complexity results do not yet match the

lower bounds. However, these results represent our first steps in the study of rein-

forcement learning using the primal-dual approach. We will improve our algorithms

in Chapter 3 using a better update scheme on the dual variables, and show that the

new algorithms achieve the theoretical optimal sample complexity lower bounds.

We make several remarks about potential improvement and extensions of the

primal-dual learning methods.

• The SPD-dMDP Algorithms 2.1 and 2.2 require that the state-action pair is

sampled uniformly from |S| × |A|. In other words, the SPD-dMDP Algorithm

2.1 and 2.2 use pure exploration without any exploitation. Such a sampling

oracle is suitable for offline learning when a fixed-size static data set is given.

In the online learning setting, the sample complexity will be improved when

actions are sampled according to the latest value and policy estimates rather

than uniformly.

• Another extension is to consider average-reward discounted MDP. In the

average-reward case, the discount factor γ and horizon H will disappear from

the sample complexity. One cannot derive the sample complexity for average-

35

reward MDP from the current result. We will study this problem in Chapter 3

and give algorithms that achieve theoretical optimal bounds.

To the authors’ belief, the primal-dual approach studied in this chapter has significant

theoretical and practical potential. The bilinear stochastic saddle point formulation

of Bellman equations is amenable to online learning and dimension reduction. The

intrinsic linear duality between the optimal policy and value functions implies a con-

venient structure for efficient learning.

36

Chapter 3

Primal-Dual π Learning Using

State and Action Features

3.1 Introduction

Reinforcement learning lies at the intersection between control, machine learning,

and stochastic processes [14, 100]. The objective is to learn an optimal policy of a

controlled system from interaction data. The most studied model for a controlled

stochastic system is the Markov decision process (MDP), i.e., a controlled random

walk over a (possibly continuous) state space S, where in each state s ∈ S one can

choose an action a from an action space A so that the random walk transitions to

another state s′ ∈ S with density p(s, a, s′). In this paper, we do not assume the

MDP model is explicitly known but consider the setting where a generative model is

given (see, e.g., [7]). In other words, there is an oracle that takes (s, a) as input and

outputs a random s′ with density p(s, a, s′) and an immediate reward r(s, a) ∈ [0, 1].

This is also known as the simulator-defined MDP in some literatures [40, 102]. Our

goal is to find an optimal policy that, when running on the MDP to generate an

37

infinitely long trajectory, yields the highest average per-step reward in the limit or

the highest accumulative discounted reward.

Here, we focus on problems where the state and action spaces S andA are too large

to be enumerated. In practice, it might be computationally challenging even to store a

single state of the process (e.g., states could be high-resolution images). Suppose that

we are given a collection of state-action feature functions φ : S ×A 7→ Rm and value

feature functions ψ : S 7→ Rn. They map each state-action pair (s, a) and state s ∈ S

to column vectors φ(s, a) = (φ1(s, a), . . . , φm(s, a))T and ψ(s) = (ψ1(s), . . . , ψn(s))

T,

respectively, where m and n are much smaller than the sizes of S and A.

Our primary interest is to develop a sample-efficient and computationally scalable

algorithm, which takes advantage of the given features to solve an MDP with very

large or continuous state space and huge action space. Given the feature maps, φ

and ψ, we adopt linear models for approximating both the value function and the

stationary state-action density function of the MDP. By doing so, we can represent

the value functions and state-action density functions, which are high-dimensional

quantities, using a much smaller number of parameters.

Contributions. Our main contribution is a tractable, model-free primal-dual π-

learning algorithm for such compact parametric representations. The algorithm ap-

plies to a general state space, which may be continuous and infinite. It incrementally

updates

• The new algorithm is inspired by a saddle point formulation of policy optimiza-

tion in MDPs, which we refer to as the Bellman saddle point problem. We show

a strong relationship between the parametric saddle point problem and the orig-

inal Bellman equation. In particular, the difference between solutions to these

two problems can be quantified using the L∞- and L1-errors of the parametric

function classes that are used to approximate the optimal value function and

38

state-action density function, respectively. In the special case where the approx-

imation error is zero (which we refer to as the “realizable” scenario), solving the

parametric Bellman saddle point problem is equivalent to solving the original

Bellman equation.

• Each iteration of the algorithm can be viewed as a stochastic primal-dual iter-

ation for solving the Bellman saddle point problem, where the value and policy

updates are coupled in light of strong duality. We study the sample complexity

of the π learning algorithm by analyzing the coupled primal-dual convergence

process. We show that finding an �-optimal policy (comparing to the best

approximate policy) requires a sample size that is linear in m+n logn�2

, ignoring

constants. The sample complexity depends only on the numbers of state and

action features. It is invariant to the actual sizes of the state and action spaces.

Notations. The following notations are used throughout the paper. For any integer

n, we use [n] to denote the set of integers {1, 2, . . . , n}. Denote (Ω,F , ζ) to be an

arbitrary measure space. Let f : Ω 7→ R be a measurable function defined on (Ω,F , ζ).

Denote by∫

Ωfdζ the Lebesgue integral of f . If the meaning of ζ is clear from the

context, we use∫

Ωf(x)dx to denote the Lebesgue integral of f . If f is absolutely

integrable, we define the L1 norm ‖f‖1 of f to be ‖f‖1 =∫

Ω|f |dζ. If f is square

integrable, we define the L2 norm ‖f‖2 of f to be ‖f‖2 = (∫

Ωf 2dζ)

12 . The L∞

norm of f is defined to be the infimum of all the quantities M ≥ 0 that satisfy

|f(x)| ≤M for almost every x. Given two measurable functions f and g, their inner

product 〈f, g〉 is defined to be∫

Ωfgdζ. For two probability distributions, u and

w, over a finite set X, we denote by DKL(u‖w) their Kullback-Leibler divergence,

i.e., DKL(u‖w) =∑

x∈X u(x) logu(x)w(x)

. For two functions f(x) and g(x), we say that

f(x) = O(g(x)) if there exists a constant C such that |f(x)| ≤ Cg(x) for all x.

39

3.2 Preliminaries and Formulation of the Problem

Let (S,F) be the state space that is equipped with an appropriate measure, and

we use∫S f(s)ds to denote the integral over this measurable space. We let A be

a finite discrete action space, and we use∫A f(a)da to denote the integral over the

action space. A function p(s, a, s′), s ∈ S, a ∈ A, s′ ∈ S is called a Markov transition

function if for each state-action pair (s, a) ∈ S×A, the function p(s, a, s′) as a function

of s′ ∈ S, is a probability density function over S where∫S p(s, a, s

′)ds′ = 1, and for

each fixed s′ ∈ S, the function p(s, a, s′) is measurable as a function of (s, a) ∈ S×A.

Let v : S 7→ R be a function defined on S. A Markov transition function p(s, a, s′)

defines the transition operator P that maps v to a function Pv : S ×A 7→ R

(Pv)(s, a) =

∫Sv(s′)p(s, a, s′)ds′. (3.2.1)

Let s′ be a state sampled from p(s, a, ·) given a state-action pair (s, a). The transition

operator has the equivalent definition

(Pv)(s, a) = Es′|s,a[v(s′)].

We denote by M = (S,A, p, p0, r) a Markov decision process where S,A, p are

defined above, p0 : S 7→ R is a bounded initial state density function and r : S×A 7→

R is a reward function with r(s, a) ∈ [0, 1]. The agent starts from the initial state

drawn from p0 and takes actions sequentially. After taking action a at state s, the

agent transitions to the next state with density p(s, a, ·). During the transition, the

agent receives a reward r(s, a).

In this work, we focus on the case where the agent applies a randomized stationary

policy. A randomized stationary policy is a function π(s, a), s ∈ S, a ∈ A, where

π(s, ·) is a distribution over the action space. Denote pπ to be the probability density

40

function of p under a fixed policy π where

pπ(s, s′) =

∫Aπ(s, a)p(s, a, s′)da,

for all s, s′ ∈ S. A Markov transition function pπ(s, s′) also defines the operator PTπ

that acts on the probability density functions

(PTπ ν)(s′) =

∫Spπ(s, s

′)ν(s)ds, (3.2.2)

where ν is a bounded probability density function defined on S. Suppose that ν is the

distribution of the agent’s current state. Then PTπ ν is the distribution of the agent’s

next state after the agent takes an action according to π.

3.2.1 Infinite-Horizon Average-Reward MDP

Denote by Π the space of all randomized stationary policies. The policy optimization

problem is to maximize the infinite-horizon average reward over Π:

v̄∗ = maxπ∈Π

{v̄π = lim sup

T→∞Eπ

[1

T

T−1∑t=0

r(st, at)|s0 ∼ p0

]}, (3.2.3)

where (s0, a0, s1, a1, . . . , st, at, . . .) are state-action transitions generated by the

Markov decision process under π from the initial distribution p0, and the expectation

Eπ[·] is taken over the entire trajectory.

Under certain continuity and ergodicity assumptions [90, 56], the optimal average

reward v̄∗ is independent of the initial distribution p0, and the policy optimization

41

problem (3.2.3) is equivalent to the following optimization problem

v̄∗ = maxπ∈Π,ξ:S7→R

∫Sξ(s)

∫Aπ(s, a)r(s, a)dads

s.t. ξ(s′) =

∫Spπ(s, s

′)ξ(s)ds, ∀s′,

ξ ≥ 0,∫Sξ(s)ds = 1 ,

(3.2.4)

where the constraint ensures that ξ is the stationary denstiy function of states under

the policy π. Let µ(s, a) = ξ(s)π(s, a). Then policy optimization problem (3.2.4)

becomes a (possibly infinite dimensional) linear program

v̄∗ = maxµ:S×A7→R

∫S

∫Aµ(s, a)r(s, a)dads

s.t.

∫Aµ(s′, a)da =

∫S

∫Ap(s, a, s′)µ(s, a)dads , ∀s,

µ ≥ 0,∫S

∫Aµ(s, a)dads = 1 ,

(3.2.5)

where the constraint ensures that µ is a stationary state-action density of the MDP

under some policy. We denote by µ∗ the optim

Documents

On the Complexity of Markov Decision Problems...Abstract The Markov decision problem (MDP) is one of the most basic models for sequential decision-making problems in a dynamic environment