Sutton Barto

Embed Size (px)

Text of Sutton Barto

iReinforcement Learning:An Introduction

Second edition, in progress

Richard S. Sutton and Andrew G. Barto

A Bradford Book

The MIT PressCambridge, Massachusetts

London, England

ii

In memory of A. Harry Klopf

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Series Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Summary of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

I The Problem 1

1 Introduction 3

1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 4

1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Elements of Reinforcement Learning . . . . . . . . . . . . . . 7

1.4 An Extended Example: Tic-Tac-Toe . . . . . . . . . . . . . . 10

1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 History of Reinforcement Learning . . . . . . . . . . . . . . . 16

1.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . 23

2 Bandit Problems 25

2.1 An n-Armed Bandit Problem . . . . . . . . . . . . . . . . . . 26

2.2 Action-Value Methods . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Softmax Action Selection . . . . . . . . . . . . . . . . . . . . . 30

2.4 Incremental Implementation . . . . . . . . . . . . . . . . . . . 32

2.5 Tracking a Nonstationary Problem . . . . . . . . . . . . . . . 33

2.6 Optimistic Initial Values . . . . . . . . . . . . . . . . . . . . . 35

2.7 Associative Search (Contextual Bandits) . . . . . . . . . . . . 37

iii

iv CONTENTS

2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 40

3 The Reinforcement Learning Problem 43

3.1 The AgentEnvironment Interface . . . . . . . . . . . . . . . . 43

3.2 Goals and Rewards . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Unified Notation for Episodic and Continuing Tasks . . . . . . 52

3.5 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 58

3.7 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.8 Optimal Value Functions . . . . . . . . . . . . . . . . . . . . . 66

3.9 Optimality and Approximation . . . . . . . . . . . . . . . . . 71

3.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.11 Bibliographical and Historical Remarks . . . . . . . . . . . . . 74

II Tabular Action-Value Methods 79

4 Dynamic Programming 83

4.1 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Asynchronous Dynamic Programming . . . . . . . . . . . . . . 98

4.6 Generalized Policy Iteration . . . . . . . . . . . . . . . . . . . 99

4.7 Efficiency of Dynamic Programming . . . . . . . . . . . . . . . 101

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 103

5 Monte Carlo Methods 107

5.1 Monte Carlo Policy Evaluation . . . . . . . . . . . . . . . . . 108

CONTENTS v

5.2 Monte Carlo Estimation of Action Values . . . . . . . . . . . . 112

5.3 Monte Carlo Control . . . . . . . . . . . . . . . . . . . . . . . 114

5.4 On-Policy Monte Carlo Control . . . . . . . . . . . . . . . . . 118

5.5 Evaluating One Policy While Following Another (Off-policy Pol-icy Evaluation) . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.6 Off-Policy Monte Carlo Control . . . . . . . . . . . . . . . . . 122

5.7 Incremental Implementation . . . . . . . . . . . . . . . . . . . 124

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 127

6 Temporal-Difference Learning 129

6.1 TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2 Advantages of TD Prediction Methods . . . . . . . . . . . . . 134

6.3 Optimality of TD(0) . . . . . . . . . . . . . . . . . . . . . . . 137

6.4 Sarsa: On-Policy TD Control . . . . . . . . . . . . . . . . . . 141

6.5 Q-Learning: Off-Policy TD Control . . . . . . . . . . . . . . . 144

6.6 Games, Afterstates, and Other Special Cases . . . . . . . . . . 147

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . 149

7 Eligibility Traces 153

7.1 n-Step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . 154

7.2 The Forward View of TD() . . . . . . . . . . . . . . . . . . . 159

7.3 The Backward View of TD() . . . . . . . . . . . . . . . . . . 163

7.4 Equivalence of Forward and Backward Views . . . . . . . . . . 166

7.5 Sarsa() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.6 Q() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.7 Replacing Traces . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.8 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . 178

7.9 Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

vi CONTENTS

7.11 Bibliographical and Historical Remarks . . . . . . . . . . . . . 180

8 Planning and Learning with Tabular Methods 183

8.1 Models and Planning . . . . . . . . . . . . . . . . . . . . . . . 183

8.2 Integrating Planning, Acting, and Learning . . . . . . . . . . . 186

8.3 When the Model Is Wrong . . . . . . . . . . . . . . . . . . . . 191

8.4 Prioritized Sweeping . . . . . . . . . . . . . . . . . . . . . . . 194

8.5 Full vs. Sample Backups . . . . . . . . . . . . . . . . . . . . . 198

8.6 Trajectory Sampling . . . . . . . . . . . . . . . . . . . . . . . 202

8.7 Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

8.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 209

III Approximate Solution Methods 211

9 On-policy Approximation of Action Values 213

9.1 Value Prediction with Function Approximation . . . . . . . . 214

9.2 Gradient-Descent Methods . . . . . . . . . . . . . . . . . . . . 217

9.3 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.4 Control with Function Approximation . . . . . . . . . . . . . . 230

9.5 Should We Bootstrap? . . . . . . . . . . . . . . . . . . . . . . 236

9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

9.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . 239

10 Off-policy Approximation of Action Values 243

11 Policy Approximation 245

12 State Estimation 247

13 Temporal Abstraction 249

CONTENTS vii

IV Frontiers 251

14 Biological Reinforcement Learning 253

15 Applications and Case Studies 255

15.1 TD-Gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

15.2 Samuels Checkers Player . . . . . . . . . . . . . . . . . . . . . 261

15.3 The Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

15.4 Elevator Dispatching . . . . . . . . . . . . . . . . . . . . . . . 268

15.5 Dynamic Channel Allocation . . . . . . . . . . . . . . . . . . . 273

15.6 Job-Shop Scheduling . . . . . . . . . . . . . . . . . . . . . . . 277

16 Prospects 285

16.1 The Unified View . . . . . . . . . . . . . . . . . . . . . . . . . 285

16.2 Other Frontier Dimensions . . . . . . . . . . . . . . . . . . . . 288

References 291

Index 316

viii PREFACE

Preface

We first came to focus on what is now known as reinforcement learning in late1979. We were both at the University of Massachusetts, working on one ofthe earliest projects to revive the idea that networks of neuronlike adaptiveelements might prove to be a promising approach to artificial adaptive intel-ligence. The project explored the heterostatic theory of adaptive systemsdeveloped by A. Harry Klopf. Harrys work was a rich source of ideas, andwe were permitted to explore them critically and compare them with the longhistory of prior work in adaptive systems. Our task became one of teasingthe ideas apart and understanding their relationships and relative importance.This continues today, but in 1979 we came to realize that perhaps the simplestof the ideas, which had long been taken for granted, had received surprisinglylittle attention from a computational perspective. This was simply the idea ofa learning system that wants something, that adapts its behavior in order tomaximize a special signal from its environment. This was the idea of a he-donistic learning system, or, as we would say now, the idea of reinforcementlearning.

Like others, we had a sense that reinforcement learning had been thor-oughly explored in the early days of cybernetics and artificial intelligence.