Discovering Hierarchy in Reinforcement Learningbernhardh/BHPhD2003.pdf · 2004-07-03 · Abstract This thesis addresses the open problem of automatically discovering hierarchical

Discovering Hierarchy in

Reinforcement Learning

Bernhard Hengst

PhD Thesis

School of Computer Science and Engineering

University of New South Wales

Australia

December 2003

c©2003 Bernhard Hengst

Declaration

I hereby declare that this submission is my own work and to the best

of my knowledge it contains no material previously published or writ-

ten by another person, nor material which to a substantial extent has

been accepted for the award of any other degree or diploma at UNSW

or any other educational institution, except where due acknowledgement

is made in the thesis. Any contribution made to the research by others,

with whom I have worked at UNSW or elsewhere, is explicitly acknowl-

edged in the thesis.

I also declare that the intellectual content of this thesis is the product

of my own work, except to the extent that assistance from others in

the project’s design and the conception or in style, presentation and

linguistic expression is acknowledged.

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Abstract

This thesis addresses the open problem of automatically discovering hierarchical

structure in reinforcement learning.

Current algorithms for reinforcement learning fail to scale as problems become

more complex. Many complex environments empirically exhibit hierarchy and can

be modelled as interrelated subsystems, each in turn with hierarchic structure. Sub-

systems are often repetitive in time and space, meaning that they reoccur as com-

ponents of different tasks or occur multiple times in different circumstances in the

environment. A learning agent may sometimes scale to larger problems if it success-

fully exploits this repetition. Evidence suggests that a bottom up approach that

repetitively finds building-blocks at one level of abstraction and uses them as back-

ground knowledge at the next level of abstraction, makes learning in many complex

environments tractable.

An algorithm, called HEXQ, is described that automatically decomposes and

solves a multi-dimensional Markov decision problem (MDP) by constructing a multi-

level hierarchy of interlinked subtasks without being given the model beforehand.

The effectiveness and efficiency of the HEXQ decomposition depends largely on the

choice of representation in terms of the variables, their temporal relationship and

whether the problem exhibits a type of constrained stochasticity.

The algorithm is first developed for stochastic shortest path problems and then

extended to infinite horizon problems. The operation of the algorithm is demon-

strated using a number of examples including a taxi domain, various navigation

tasks, the Towers of Hanoi and a larger sporting problem.

The main contributions of the thesis are the automation of (1) decomposition,

(2) sub-goal identification, and (3) discovery of hierarchical structure for MDPs with

states described by a number of variables or features. It points the way to further

scaling opportunities that encompass approximations, partial observability, selective

perception, relational representations and planning. The longer term research aim

is to train rather than program intelligent agents.

Acknowledgements

I would like to thank Professor Claude Sammut, my supervisor, for his support

over the years. He identified “scaling” as an important issue facing reinforcement

learning, which directly led to this thesis topic. My understanding of research has

benefited from our numerous conversations and his critical insights. He has sup-

ported my attendance at ICML conferences and my participation in the RoboCup

Sony legged league for four years.

I have enjoyed tutoring the Introduction to AI course for two years and guest

lecturing in Machine Learning. I thank my co-supervisor Achim Hoffmann, Claude

Sammut and Mike Bain for these opportunities.

Donald Michie was a long term visitor at the school on several occasions. I

thank him for his initial encouragement and later for the stimulating discussions on

structured induction over extended lunches and via long e-mails.

I have found the many researchers in machine learning, that I have had reason

to contact personally, to always be responsive and willing to offer assistance, despite

their onerous time commitments. They include Chuck Anderson, David Andre,

Andy Barto, Tom Dietterich, Milos Hauskrecht, Richard Korf, Sridhar Mahade-

van, Shie Mannor, Tom Mitchell, Andrew Moore, Ron Parr, Balaraman Ravindran

(Ravi), Sebastian Thrun, Paul Utgoff, Chris Watkins and Rich Sutton.

I recall “discovering” reinforcement learning in 1998 and purchasing Rich Sutton

and Andy Barto’s introductory book on the subject. Rich Sutton did not want the

answers to the exercises widely distributed, but promised to email them to me, a

chapter at a time, if I first sent him my attempted answers. This discipline helped

develop an understanding of the subject which served me well in the intervening

years.

I value the friends and associations made with other research students, past and

present, in the department of artificial intelligence: Rex Kwok, Michael Harries,

Waleed Kadous, Mark Peters, Jane Brennan, JP Bekmann, Phil Preston, Barry

Drake, Duncan Potts, Andy Isaacs, Malcolm Ryan, Mark Reid, Cameron Stone,

James Westendorp and Solly Brown.

On our first meeting, Barry Drake had to explain the meaning of an aliased

state to me. I thank Barry for his assistance, the long philosophic discussion over

iv

coffee, the organisation of the special interest group meetings on spatial topics (SIG

Spatial) and the many games of chess.

Duncan Potts paid me the compliment of reproducing the HEXQ algorithm and

results, based only on the cryptic description in the 2002 ICML paper. I discovered

this one day on his web site, quite by accident. He has since made his own contri-

bution to HEXQ, speeding up the algorithm. I thank him for the subsequent useful

discussions.

I am grateful to Phil Preston for his early and continuing help in the lab with

everything from manufacturing video cables to directing me to security facilities. He

has always been willing to give of his time.

The School’s support staff deserve special recognition. Without them the many

details of room bookings, shipments, travel and function arrangements would not

have proceeded as smoothly. In particular I would like to thank Les Sharpley,

Mariann Davies, Tanya Oshuiko, Sue Lewis, Ann Baker, Magda Chambers and

Brad Hall.

My participation in the RoboCup Sony legged league soccer competition has

been a significant research motivator. It was made all the more rewarding by the

many associations with the staff and students involved in this project over the years.

They include: Son Pham, Darren Ibbotson, John Dalgliesh, Mike Lawther, Tak

Fai Yik, Martin Siu, Spencer Chen, Tom Vogelgesang, Ken Nguyen, Hao Nguyen,

Andres Olave, David Wang, James Wong, Nik Von Huben, James Brooks, Tim

Tam, Min Sub Kim, Alan Tay, Benjamin Leung, Albert Chang, Ricky Chen, Eric

Chung, Ross Edwards, Eileen Mak, Raymond Sheh, Nic Sutanto, Terry Tam, Alex

Tang, Nathan Wong, Brad Hall, Claude Sammut, Alan Blair, Maurice Pagnucco,

Will Uther and Tatjana Zrimec. This league would not of course have been possible

without Sony. Masahiro Fujita, Principal Scientist and System Architect and his

staff at the Intelligent Dynamics Laboratory have been most helpful and supportive.

I would like to thank the following people for reading parts or the whole of this

thesis prior to submission and providing me with valuable feedback: Duncan Potts,

Eduardo Morales, Barry Drake, Achim Hoffmann, Waleed Kadous, Coral Hengst,

Alan Blair, Solly Brown and Claude Sammut. If the thesis is any more readable it

is due to their helpful suggestions.

v

A special thanks to my son, Kyle Hengst, who animated the optimum behaviour

policy found for the ball kicking domain. I have exploited the visual impact of these

animations on a number of occasions.

Finally , I would like to thank my sons Kyle and Shane and particularly my wife,

Coral, for their love and support during this midlife change in career direction.

This thesis is dedicated to the memory of my grandfather,

Joseph Zenker.

15

9

6

4

13

1

14

3

10

8

19

12

5

7

11

16

2

17

18

Contents

1 Introduction 1

1.1 Scope and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 HEXQ Automatic Decomposition . . . . . . . . . . . . . . . . . . . . 7

1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Preliminaries 13

2.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Markov Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Semi-Markov Decision Problems . . . . . . . . . . . . . . . . . . . . . 19

3 Background 22

3.1 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Learning Hierarchical Models in Stages . . . . . . . . . . . . . . . . . 25

3.4 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . 27

3.4.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2 HAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.3 MAXQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.4 Optimality of Hierarchical Structures . . . . . . . . . . . . . . 36

3.5 Learning Hierarchies: The Open Question . . . . . . . . . . . . . . . 38

vii

CONTENTS viii

3.5.1 Bottleneck and Landmark States . . . . . . . . . . . . . . . . 39

3.5.2 Common Sub-spaces . . . . . . . . . . . . . . . . . . . . . . . 39

3.5.3 Multi-dimensional States . . . . . . . . . . . . . . . . . . . . . 40

3.5.4 Markov Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5.5 Other Approaches to Discovering Hierarchy . . . . . . . . . . 43

3.6 Conclusions and Motivation for HEXQ . . . . . . . . . . . . . . . . . 44

4 HEXQ Decomposition 46

4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 HEXQ Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Partitioning MDPs of Dimension Two . . . . . . . . . . . . . 49

4.2.2 Sub-MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.3 Top level semi-MDP . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.4 Higher Dimensional MDPs . . . . . . . . . . . . . . . . . . . . 58

4.2.5 Value Function Decomposition with HEXQ Trees . . . . . . . 58

4.3 Optimality of HEXQ Trees . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.1 Globally Optimal HEXQ . . . . . . . . . . . . . . . . . . . . . 66

4.4 Representing HEXQ trees compactly . . . . . . . . . . . . . . . . . . 69

4.4.1 Markov Equivalent Regions (MERs) . . . . . . . . . . . . . . 69

4.4.2 State Abstracting Markov Equivalent Regions . . . . . . . . . 71

4.4.3 Compaction of Higher Dimensional MDPs . . . . . . . . . . . 72

5 Automatic Decomposition: The HEXQ Algorithm 76

5.1 Variable ordering heuristics . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Finding Markov Equivalent Regions . . . . . . . . . . . . . . . . . . . 82

5.2.1 Discovering Exits . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2.2 Forming Regions . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3 Creating and Solving Region Sub-MDPs . . . . . . . . . . . . . . . . 92

5.4 Hierarchical State Value . . . . . . . . . . . . . . . . . . . . . . . . . 95

CONTENTS ix

5.5 State and Action Abstraction . . . . . . . . . . . . . . . . . . . . . . 95

5.6 HEXQ Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.7 Efficiency Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.7.1 No-effect Actions . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.7.2 Combining Levels . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6 Empirical Evaluation 102

6.1 The Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.1.1 Automatic Decomposition of the Taxi Domain . . . . . . . . . 104

6.1.2 Taxi Performance . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.1.3 Taxi with Four State Variables . . . . . . . . . . . . . . . . . 116

6.1.4 Fickle Taxi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1.5 Taxi with Fuel . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2 Twenty Five Rooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.3 Towers of Hanoi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.4 Ball Kicking - a larger MDP . . . . . . . . . . . . . . . . . . . . . . . 137

6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7 Decomposing Infinite Horizon MDPs 146

7.1 The Decomposed Discount Function . . . . . . . . . . . . . . . . . . . 149

7.2 Infinite Horizon MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.3 Infinite Horizon Experiments . . . . . . . . . . . . . . . . . . . . . . . 156

7.3.1 Continuing Taxi . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.3.2 Ball Kicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8 Approximations 163

8.1 Variable Resolution Approximations . . . . . . . . . . . . . . . . . . . 164

8.1.1 Hierarchies of Abstract Models . . . . . . . . . . . . . . . . . 165

CONTENTS x

8.1.2 Variable Resolution Value Function . . . . . . . . . . . . . . . 167

8.1.3 Variable Resolution Exit States . . . . . . . . . . . . . . . . . 169

8.1.4 Variable Resolution Exits . . . . . . . . . . . . . . . . . . . . 170

8.2 Stochastic Approximations . . . . . . . . . . . . . . . . . . . . . . . . 173

8.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 177

9 Future Research 178

9.1 Discovering Exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.2 Stochasticity at Region Boundaries . . . . . . . . . . . . . . . . . . . 180

9.3 Multiple Simultaneous Regions . . . . . . . . . . . . . . . . . . . . . 181

9.4 Multi-dimensional Actions . . . . . . . . . . . . . . . . . . . . . . . . 183

9.5 Default Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

9.6 Dynamic Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

9.7 Selective Perception and Hidden State . . . . . . . . . . . . . . . . . 186

9.8 Deictic Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.9 Quantitative Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 191

9.10 Average Reward HEXQ . . . . . . . . . . . . . . . . . . . . . . . . . 192

9.11 Deliberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

9.12 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

9.13 Scaling to Greater Heights . . . . . . . . . . . . . . . . . . . . . . . . 194

9.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

10 Conclusion 197

10.1 Summary of Main Ideas . . . . . . . . . . . . . . . . . . . . . . . . . 198

10.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . 199

10.3 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

BIBLIOGRAPHY 201

APPENDICES 210

CONTENTS xi

A Mathematical Review 211

A.1 Partitions and Equivalence Relations . . . . . . . . . . . . . . . . . . 211

A.2 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

A.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

B Significance Values in Graphs 216

B.1 Error Bars and Confidence Intervals . . . . . . . . . . . . . . . . . . . 216

B.2 Taxi Learning Rate and Convergence . . . . . . . . . . . . . . . . . . 218

List of Figures 220

List of Tables 227

Chapter 1

Introduction

The objective of this thesis is to study an approach to the discovery of hierarchical

structure in reinforcement learning. The key idea is to automatically find invari-

ant reusable sub-tasks and abstract them to form a reduced model of the original

problem that requires significantly less space and time for its solution.

Reinforcement learning scales poorly because the state space description of a

problem grows exponentially with the number of variables. Bellman (1961) referred

to this as the “curse of dimensionality” adding that sheer enumeration will not solve

problems of any significance.

Many large problems have some structure that allows them to be broken down

into sub-problems and represented more compactly. The sub-problems, being smaller,

are often solved more easily. The solutions to the sub-problems may be combined to

provide the solution for the original larger problem. This decomposition may make

finding the final solution significantly more efficient. Designers usually decompose

problems manually, however, automating the decomposition of a problem appears to

be more difficult. It is desirable to have machines with this ability to free designers

from this task and to allow the machines to adapt to new and unforeseen situations.

1

Introduction 2

1.1 Scope and Assumptions

The scope of this thesis is limited to a particular heuristic algorithm for the auto-

matic hierarchical decomposition and solution of finite Markov decision problems

(MDPs)1. The algorithm is called HEXQ.

The states of the MDP are assumed to be multi-dimensional, meaning, that they

are defined by a number of variables that describe the features of the problem. A

robot’s location, for example, may be given by two variables, the room it occupies

and its position in the room. The variables can be interpreted as a robot’s sensor

readings that provide information about its environment.

A model of the environment describes the probabilistic state transition and re-

ward received by the agent after taking an action. It is assumed that the learning

agent is not in possession of a model of its environment at the beginning, but has

to discover this for itself during learning.

The decomposition heuristic employed by HEXQ proceeds on a variable by vari-

able basis. For many problems it is possible for an agent to learn a partial model

over a subset of the variables that is invariant in all contexts represented by the

rest of the variables. Reusable policies may be learnt over the regions of the state

space defined by the partial model. The original problem is reduced and solved by

factoring out the region details. For example, a robot may learn to navigate inside

rooms and out of doorways. The detailed skills of intra-room navigation can be en-

capsulated as abstract actions and whole rooms considered as the states of a smaller

abstract MDP. In this smaller problem the focus is on learning the best policy to

traverse rooms, to say leave a building. If, in addition, the detailed room leaving

skills can be reused, the overall savings in learning time and policy storage can be

significant.

No assumption needs to be made about the interdependence of the variables

1An MDP is formally described in Chapter 2.

Introduction 3

describing the MDP, but an MDP will only decompose in a useful way, given some

constraints in the problem in addition to the Markov property2. An informal char-

acterisation of the constraints required for a useful HEXQ decomposition is that

• the state space is described by a set of variables

• some variables must change on a slower time scale than others,

• the more frequently changing variables should retain the same values to rep-

resent similar features in the context of the slower changing variables, and

• policies can be learnt over the more frequently changing variables to control

the way sub-space regions can be traversed.

The designer is required to ensure that the problem is specified as a Markov

decision problem. The designer is not required to know how the problem can de-

compose, if at all. However, the definition of the variables can make a significant

difference to the efficient decomposition of a problem. If a designer is in a posi-

tion to influence the choice of variables, then an understanding of the operation of

HEXQ may allow a more judicious selection of these variables to obtain a better

decomposition. Section 4.2.1, for example, illustrates different decompositions when

a robot’s position is described using a coordinate system and a local representation.

The quality of the HEXQ decomposition, and HEXQ’s ability to generalise, will

depend on whether any structure in the MDP has been captured by the variables

in some meaningful way.

1.2 A Simple Example

A simple example will help make the learning problem concrete and illustrate the

basic concepts. This example has been chosen because the decomposition is easy to

2The Markov property is defined in Chapter 2.

Introduction 4

visualise3.

Goal

Figure 1.1: A maze showing three rooms interconnected via doorways. Each roomhas 25 positions. The aim of the robot is to reach the goal.

Figure 1.1 shows a maze with three rooms interconnected via doorways. Each

room has 25 positions labelled in the same manner for each room. The robot’s

objective is to learn how to quickly leave the maze via either of the two exits to the

goal. The robot has two sensors. One to tell it which room it is in and one that

measures its position in a room. The robot is started at a random location in the

maze. The actions available are to hop deterministically to an adjacent cell in one

of the four compass directions: North, South East or West. If the robot hops into

a barrier it remains where it is, shaken but uninjured. Every action costs one unit,

except the hop to the goal, which is free. On each hop, the robot’s sensors provide

its location inside the maze, that is, state=(room, position-in-room).

A reinforcement learner can explore this problem and learn an optimal value

function (distance to the goal) by backing up the values from each goal, adding one

unit of cost for each additional step to reach the goal. The resulting optimal value

function is shown in figure 1.2. To reach the goal in the quickest way from any

location, it is a simple matter of continually hopping to the next lowest value cell.

The three rooms have identical internal transition and reward functions, yet a

3The very simplicity of this example will not allow it to be used to highlight the scaling potential

Introduction 5

11 10 9 8 7

10 9 8 7 6

9 8 7 6 5

8 7 6 7 6

7 6 5 6 7

6 5 4 5 6

5 4 3 4 5

4 3 2 3 4

3 2 1 2 3

2 1 0 1 2

6 5 4 3 2

5 4 3 2 1

4 3 2 1 0

5 4 3 2 1

6 5 4 3 2

Figure 1.2: The maze showing the cost to reach the goal from any location.

reinforcement learner explores each room without reference to any of the others.

To decompose the problem automatically the robot could explore the behaviour

of each sensor variable separately. It can determine that many transitions for the

position-in-room variable are able to be modelled reliably without reference to the

room value. By focussing on the position-in-room variable, the only positions from

which results are unpredictable are where the robot exits a room. The robot there-

fore considers the sub-problem of how to leave each room via one of its exits.

6 5 4 3 2

5 4 3 2 1

4 3 2 1 0

5 4 3 2 1

6 5 4 3 2

6 5 4 5 6

5 4 3 4 5

4 3 2 3 4

3 2 1 2 3

2 1 0 1 2

6 5 4 3 2

5 4 3 2 1

4 3 2 1 0

5 4 3 2 1

6 5 4 3 2

6 5 4 5 6

5 4 3 4 5

4 3 2 3 4

3 2 1 2 3

2 1 0 1 2

1

1

Room 0

Room 1

Room 2

Figure 1.3: The maze, decomposed into rooms, showing the number of steps to reacheach room exit on the way to the goal.

It proceeds to learn separate value functions over just one room, in the same

way the value function was found previously for the whole maze. The room value

functions can be reused. For example, exiting the bottom left room to the East

Introduction 6

produces the same room values for each room position as exiting the top left room

to the east. Similarly, exiting the top right room to the South produces the same

values as exiting the top left room to the South. The position values for each of

these room exits are shown in figure 1.3.

Room 0 Room 1

Room 2

Figure 1.4: The maze, abstracted to a reduced model with only three states, onefor each room. The arrows indicate transitions for the abstract actions that are thepolicies for leaving a room to the North, South, East and West.

Having learnt room leaving policies, the original problem can be abstracted and

reduced to a simpler model with just three states, one for each room, as illustrated

in figure 1.4. The key property that makes this abstraction possible is that the value

to reach the goal after exiting each room is independent from the value inside each

room to reach the room exit.

For the robot to decide the best way to act in any location in the maze it must

compose the total value function by adding the cost components of the journey back

from the goal. For example, if it started in the very top left location there are two

different inter-room routes to reach the goal. In each case, the cost to reach the goal

after exiting the top left room is 4. Adding 1 unit of cost to hop into either of the

next two rooms and another 6 to reach a top left room exit, makes the total cost

11, just as it was previously. If instead the starting location is in the top right hand

corner of the top left room, there are still two direct paths to the goal via different

rooms. By adding up the costs this time, one will take 11 units and the other only

Introduction 7

7 units. In this way the robot can choose the shortest path, in this case, via the top

right room.

This example has demonstrated how policies and regions for sub-problems can

be abstracted and reused to solve the original problem. How can the decomposition

and solution process be automated?

1.3 HEXQ Automatic Decomposition

This section provides a broad overview of the HEXQ algorithm. The details will

become clearer in Chapters 4 and 5.

HEXQ approaches the automatic decomposition of MDPs by a variable-wise

search for Markov sub-space regions. The decomposition depends largely on the

choice of variables and the structure of the problem. Since variables that are asso-

ciated with the lower levels in a control hierarchy tend to change more frequently,

the variables are first sorted by their frequency of change using a purely random

exporation policy. During a random walk in the maze example, the room variable

will change less frequently than the position-in-room variable.

HEXQ first explores the behaviour of the most frequently changing variable. It

looks for state transitions for this variable (and rewards) that are probabilistically

predictable without any of the other variables changing value. The states of the

variable (its values) may be partitioned into regions that have internal Markov

transition and reward properties invariant of the values of the other variables. Such

regions in the maze in figure 1.1 are the rooms.

HEXQ flags all transitions where either other variables change value or the tran-

sitions (or rewards) are unpredictable4. These transitions are called exits, because,

when executed, the agent leaves a region5. It is a condition of a HEXQ decom-

position that all exits must be reachable from inside a region with probability one

4i.e. non-stationary5a region exit may leave a region and transition back to the same region

Introduction 8

without being forced to leave the region elsewhere. In the maze example, the exits

are the transitions through the doorways.

HEXQ constructs multiple sub-MDPs for a region, each with the sub-goal of

leaving the region via one of its exits. The different policies to leave each region are

cached.

A new abstracted MDP is formed at the next level. Its abstract states are

generated by taking the Cartesian product of the region identifiers with the values

of the next most frequently changing variable. Its abstract actions are the cached

policies for each region. This MDP is a semi-MDP because the abstract actions

usually operate over an extended period of time until they terminate with a region

exit. For the maze, this abstract MDP is shown in figure 1.4.

The above procedure can be repeated until only one variable remains defining a

top-level sub-MDP. Solving this sub-MDP solves the original MDP.

HEXQ uses a hierarchically decomposed value function. The value function for

any state is composed of the reward accumulated inside each region to reach an exit

sub-goal, plus the value to continue, represented at more abstract levels.

Opportunities to reduce models and compactly represent value functions present

themselves when

• the policies of one region can be reused,

• a whole class of identical regions represents repetitive sub-spaces,

• region states can be abstracted, and

• policy and state abstractions can be employed at multiple levels.

1.4 Contribution

The contributions of this thesis are:

Introduction 9

• An automatic decomposition algorithm for reinforcement learning problems.

The decomposition partitions the states into sub-problems that form smaller

reinforcement learning problems and are connected hierarchically to solve the

overall problem.

• A formulation of value function decomposition equations that generalises the Q

function to abstract MDPs. It allows automatic hierarchical credit assignment

as rewards that cannot be explained at lower levels are relegated to be modelled

at higher levels.

• A method for automatic discovery of sub-goals. Sub-goals are the region exit

states. They are generated naturally as part of the process of region discovery.

• A method for the abstraction of similar regions. Regions may be similar be-

cause they represent similar physical objects or the same object, in different

contexts or with different attributes. HEXQ automatically models similar re-

gions as one region class.

• A method for automatic and safe abstraction of both region policies and states.

A HEXQ decomposition guarantees safe abstraction of region states. The

algorithm is robust in the sense that, if the decomposition produces single

state regions, HEXQ effectively defaults to solving the “flat” problem.

• Proof of globally optimal solutions for HEXQ decomposed deterministic short-

est path problems and similar problems where only the top level sub-MDP is

stochastic.

• A method for safe state abstraction using a decomposed discounted value

function. This extends HEXQ to be able to tackle any finite MDP using a

discounted value function. The main innovation here is the introduction of

an additional and separate decomposed discount function working in concert

Introduction 10

with the decomposed value function to safely and compactly represent state

values in a discounted setting.

• HEXQ extensions for automatic hierarchical decomposition of infinite horizon

problems (problems that do not terminate), where good solutions may require

the agent to continue execution in a sub-task.

• An introduction to variable resolution and stochastic approximation tech-

niques over and above safe state abstraction that reduce the computational

complexity further with a controllable trade-off in solution quality. The control

over decision time complexity may provide an anytime execution capability.

• Outline of other future research directions including the construction of more

general hierarchies and training instead of programming agents.

These results make a contribution to the open problem of discovering hierarchical

structure in reinforcement learning. The benefit is that the agent may decompose its

environment based on its own experience, thus relieving a designer from performing

this task. In the best case, hierarchical decomposition leads to space complexity

linear in the number of variables used to describe the sensor state. Empirical eval-

uation testifies to the versatility of HEXQ. In one case a problem is easily solved in

seconds that would otherwise require billions of table entries and is intractable on

present day computers.

1.5 Outline of thesis

Chapter 2 reviews some Markov and semi-Markov decision processes formalisms

that are used in other chapters. This Chapter is supplemented with other

mathematical concepts and algorithms in Appendix A. These concepts are

considered basic and are available through a number of sources. The chapter

Introduction 11

can be skimmed for notation and basic assumptions that will help the reader

understand subsequent sections of the thesis.

Chapter 3 reviews background literature, examining:

• the benefit of a hierarchical approach to tackling complex problems,

• hierarchical reinforcement learning with manual decompositions,

• a number of approaches that automate decomposition and discover hier-

archical structure.

Chapter 4 focuses on the theory underlying the HEXQ algorithm. Importantly, it

formally defines the partition conditions used by HEXQ to decompose a multi-

dimensional MDP. The decomposition is explained using the simple maze ex-

ample introduced earlier. The chapter considers the issue of optimality and

shows how the decomposed value function can be represented compactly and

losslessly by abstracting both state and actions.

Chapter 5 describes the HEXQ algorithm, implementing in practice the theory

developed in Chapter 4. It explains how HEXQ explores its environment and

builds a reduced multi-level model of the original problem to find good policies.

Chapter 6 evaluates HEXQ empirically on a number of problems, illustrating its

characteristics. The problems include ones that other researchers have decom-

posed manually and some to test HEXQ in larger domains.

Chapter 7 introduces an additional supporting decomposed discount function to

allow safe state abstraction in the face of discounting. Abstraction of dis-

counted value functions allows the HEXQ algorithm to be extended to solve

infinite horizon problems in which the recursively optimal policy may require

a sub-task to persist.

Chapter 8 extends HEXQ by introducing approximations of the hierarchical value

function that further reduce computational complexity.

Introduction 12

Chapter 9 addresses some of the limitations of HEXQ and suggests a number of

potentially fruitful research directions. These include improvements to the

existing algorithm and tackling the larger problem of learning in a complex

environment where the agent’s sensor state is large, yet does not fully describe

the environment.

Chapter 2

Preliminaries

This chapter introduces basic notation and definitions for Markov and semi-Markov

decision processes. The material is generally available in introductory texts and is

not meant to be comprehensive. The literature on Markov decision processes and

reinforcement learning is extensive. Introductory texts include Puterman (1994)

and Sutton and Barto (1998).

Environment

State Reward Action

Agent

Figure 2.1: An agent interacting with its environment and receiving a reward signal.

The reinforcement learning framework comprises an agent that perceives the

environment (or domain) through sensors and acts on that environment through

effectors. The environment also produces a reward, a special numerical value that

the agent tries to maximise over time. The agent is taken to interact with its

environment at discrete time steps, t = 0, 1, 2, 3, . . .. At each time step it receives an

input representing some environmental state and a reward. It responds by taking

an action. The agent-environment interaction is depicted in figure 2.1.

13

Preliminaries 14

2.1 The Markov Property

As an agent takes actions, a, and observes the states, s, of the environment, the

history of states and actions up to time t is . . . , st−2, at−2, st−1, at−1, st, at. If the

probability of the next state, s′, is only dependent on the last state and action then

the state description is said to have the Markov property. This can be defined by

specifying Pr{st+1 = s′|st, at} for all s′, st and at.

Rewards similarly have the Markov property, when the probability of the next

reward value depends only on the last state and action Pr{rt+1 = r|st, at}.

2.2 Markov Decision Problems

Well established formal representations exist1 to describe an agent-environment in-

teraction as a Markov decision problem. In this thesis, Markov decision problems

are defined by a finite number of states st ∈ S and a finite number of actions,

at ∈ A(st), that are available in each state. At discrete time steps, given action at,

the system transitions from state st to state st+1. A bounded real reward, rt+1 ∈ <is given to the agent at each time step. The state transition and reward functions

are both assumed to have the Markov property. Further, they are both assumed to

be stationary functions, meaning they are independent of time. Formally:

A discrete time, finite Markov Decision Process (MDP) is a tuple 〈S, A, T, R, S0〉where

• S is a finite set of states, s ∈ S.

• A is a finite set of actions, a ∈ A, A =⋃

s∈S A(s).

• T is the one step probability function of transitioning from one state to the

1the notation largely follows Sutton and Barto (1998).

Preliminaries 15

next when taking an action.

T ass′ = Pr{st+1 = s′|st = s, at = a} (2.1)

This one step transition probability is stationary, meaning that it is indepen-

dent of time and is written more succinctly as Pr{s′|s, a}.

• R is a bounded reward function giving the expected reward on transition from

state s to the next state s′ after taking action a.

Rass′ = E{rt+1|st+1 = s′, st = s, at = a} (2.2)

The reward function is assumed to be stationary.

Rass′ = E{r|s, a, s′} =

∑r

r · Pr{r|s, a, s′} (2.3)

• S0 is the starting state distribution. This means that the MDP is initialized

in state s with probability S0(s)

In general MDPs can have an infinite number of states or actions. This thesis

considers only MDPs that have a finite number of states and actions.

A model in relation to an MDP refers to the state transition and reward functions.

When the state s ∈ S is described by a vector of d state variables: s =

(s1, s2, . . . , si, . . . , sd) where si is the ith state variable, the state s is said to have a

dimension of d. The associated MDP will be referred to as a d-dimensional MDP.

An MDP policy, π, is a mapping from states to possible actions at each time step.

When this mapping is probabilistic, the notation π(s, a) is the stationary probability

of taking action a in state s. The notation π(s) refers to the action chosen according

to the probability distribution, that is π(s) = a. A deterministic policy is one such

that for all s, π(s, a) = 1 for exactly one a.

Preliminaries 16

A deterministic policy is not to be confused with deterministic actions that en-

sure the next state is determined by the action, that is T ass′ = 1 or 0. A deterministic

policy may or may not involve deterministic actions. Deterministic rewards likewise

mean that Pr{r|s, a, s′} = 1 or 0. Deterministic transitions mean that both actions

and rewards are deterministic.

A Markov Decision Problem has an optimality criterion to maximise a value

function, usually some measure of future reward. Value functions may be based on

average reward, sum of rewards for a fixed number of time-steps, etc. Throughout

this thesis the value function is the commonly used sum of (discounted) future

rewards. In this case the value function for state s in MDP m with a policy, π, and

a discount rate, γ, is given by:

V πm(s) = E{rt+1 + γrt+2 + γ2rt+3....|st = s, π} (2.4)

sT-2 sT-1 sT

rT-1 rT

r = 0

Absorbing state

MDP terminates

Figure 2.2: Episodic MDP showing the transition on termination to a hypotheticalabsorbing state

Episodic MDPs are ones that eventually terminate with probability one in con-

trast to infinite horizon MDPs that may not. To unify the value function definition

for both cases an episodic MDP is modelled by assuming it enters a hypothetical

absorbing state on termination. All transitions from the absorbing state lead back

to that state with probability 1 and reward 0 as illustrated in figure 2.2.

Preliminaries 17

For infinite horizon MDPs the discount factor, γ, is in the range, 0 ≤ γ < 1.

For episodic problems γ can also be equal to 1. In this case the value function is

bounded as no further reward can accumulate after entering the absorbing state.

For an MDP m with a fixed policy π the value of state s is “backed up” from the

possible next states and can be written as the expected value of the next expected

reward together with the discounted value of the next state. This is the Bellman

equation:

V πm(s) =

∑

s′T

π(s)ss′ [R

π(s)ss′ + γV π

m(s′)] (2.5)

The optimal value function, V ∗m, maximizes the value function for all states s ∈ S

in MDP m with respect to π. Bellman proved that this is the unique solution to

the Bellman optimality equation:

V ∗m(s) = max

a

∑

s′T a

ss′ [Rass′ + γV ∗

m(s′)] (2.6)

This equation is similar to equation 2.5, except this time, instead of finding

the value of each state based on a given policy, the objective is to find a policy π,

designated by ∗, that maximises the value of each state.

The action-value function Qπ(s, a) is defined as the expected return starting in

state s, taking action a and following policy π thereafter. Qπ and optimal Q∗ are

defined in terms of V π and V ∗ respectively:

Qπm(s, a) = E{rt+1 + γV π

m(st+1)|st = s, at = a} (2.7)

Q∗m(s, a) = E{rt+1 + γV ∗

m(st+1)|st = s, at = a} (2.8)

and their respective Bellman equations are as follows:

Preliminaries 18

Table 2.1: Action-Value Iteration

function ValueIteration( MDP〈S, A, T,R〉, γ )

1. initialise Q(s, a) ← 0

2. repeat until 4 < small positive number

3. 4← 0

4. for each state s ∈ S

5. for each action a ∈ A

6. q ← Q(s, a)

7. Q(s, a) ← ∑s′ T

ass′ [R

ass′ + γV (s′)]

8. 4← max(4, |q −Q(s, a)|)9. end repeat

10. return Q(s, a)

end ValueInteration

Qπm(s, a) =

∑

s′T a

ss′ [Rass′ + γQπ

m(s′, π(s′))] (2.9)

Q∗m(s, a) =

∑

s′T a

ss′ [Rass′ + γ max

a′Q∗

m(s′, a′)] (2.10)

Dynamic programming is the usual way to solve an MDP when the model (the

state transition and reward functions) is known. Table 2.1 shows one of a class

of algorithms that returns the optimal action-value function Q∗(s, a). The optimal

value function can be calculated as V ∗(s) = maxa Q∗(s, a) and an optimal policy as

π∗(s) = arg maxa Q∗(s, a). This aglorithm was adapted from similar algorithms in

Sutton and Barto (1998). When the model is not known beforehand, it is not possible

to use dynamic programming directly, but there are several other algorithms, such

as Q-learning (Watkins and Dayan, 1992), that can both explore the state space and

simultaneously learn optimal value functions. The Q-learning algorithm in table 2.2

was adapted from Sutton and Barto (1998).

Preliminaries 19

Table 2.2: Q Learning

function Q-Learning

1. initialise Q(s, a) ← 0 and set the learning rate β

2. initialise s ← initial state

3. repeat until termination

4. choose action a using exploration policy derived from Q

5. take action a, observe r, s′

6. Q(s, a) ← (1− β)Q(s, a) + β[r + γ maxa′ Q(s′, a′)]

7. s ← s′

end Q-Learning

2.3 Semi-Markov Decision Problems

The actions in the abstracted maze (figure 1.4) in Chapter 1 are room leaving poli-

cies. These actions usually take a number of time steps to complete. In general these

temporally extended or abstract actions will be seen to be important in hierarchical

reinforcement learning as more abstract descriptions of problems use actions that

consist of a whole sequence of more primitive actions.

MDPs generalise to semi-MDPs in which actions can persist over a number of

time steps2. Denoting the random variable N to be the number of time steps that

an abstract action a takes to complete when it is executed starting in state s, the

state transition and reward functions for a semi-MDPs can be generalized. The joint

probability distribution of result state s′ reached in N time steps when action a is

executed in state s is:

TNass′ = Pr{st+N = s′|st = s, at = a} (2.11)

Similarly, the expected reward when state s′ is reached after N time steps taking

2This discrete formulation largely follows Dietterich (2000).

Preliminaries 20

action a in state s is:

RNass′ = E{

N∑n=1

γn−1rt+n|st = s, at = a, st+N = s′} (2.12)

The Bellman equations for the value functions for an arbitrary policy and optimal

policies are similar to those for MDPs with the sum taken with respect to s′ and N

using the joint probability distribution T :

V πm(s) =

∑

s′,N

TNπ(s)ss′ [R

Nπ(s)ss′ + γNV π

m(s′)] (2.13)

V ∗m(s) = max

a

∑

s′,N

TNass′ [RNa

ss′ + γNV ∗m(s′)] (2.14)

For episodic semi-MDPs with the discount factor γ set to 1, the joint probability

distribution with respect to N and s′ can be taken just with respect to s′, the

state reached on termination of the abstract action a after any number of steps. In

this case the Bellman equations are similar to the ones for MDPs with the expected

primitive reward replaced with the expected sum of primitive rewards to termination

of the abstract action.

T ass′ = Pr{s′|s, a} =

∞∑N=1

TNass′ (2.15)

Rass′ = E{RNa

ss′ |s, a, s′} (2.16)

=∞∑

N=1

TNass′ RNa

ss′

T ass′

V πm(s) =

∑

s′T

π(s)ss′ [R

π(s)ss′ + V π

m(s′)] (2.17)

Preliminaries 21

V ∗m(s) = max

a

∑

s′T a

ss′ [Rass′ + V ∗

m(s′)] (2.18)

MDPs and their generalisation to semi-MDPs are the basis for much of the

background and related work covered in the next Chapter and the formalism to

describe HEXQ decomposition in later Chapters.

Chapter 3

Background

This chapter reviews literature to position and motivate the thesis.

Two methods for scaling up reinforcement learning algorithms are function ap-

proximation and hierarchical decomposition. As the name implies, function ap-

proximation is aimed at approximating and thereby compacting a value function.

Hierarchical approaches use structure in the representation to try to compact the

representation and have the potential, in the best case, to reduce the exponen-

tial growth in the size of the state space to linear in the number of variables. These

methods are not mutually exclusive and function approximation may be used within

hierarchical representations. After a brief introduction to function approximation,

the chapter will focus on hierarchical approaches. For a more general review of

reinforcement learning the survey by Kaelbling et al. (1996) is recommended.

3.1 Function Approximation

Function approximation is used to represent a value function compactly. A state

description is mapped to a value of the function. The mapping may be fixed by

the designer or parameterised, with the parameters updated during the learning

process. In the latter case, function approximation can be viewed as supervised

learning (Sutton and Barto, 1998).

22

Background 23

There are many examples of function approximation in reinforcement learning.

Some early examples of function approximation include Samuel’s famous checker

player and Michie and Chambers (1968) BOXES algorithm1. For the checker player,

the board state value function may be approximated by a linear combination of

weighted features such as for example the size of the disk advantage. Anderson

(1986) used an artificial neural network to approximate the value function for a pole

balancer. Santamaria et al. (1998) experimented with various techniques such as tile

coding, instance and case based methods in continuous state spaces. In their book

Sutton and Barto (1998) include a good introduction to function approximation and

some case studies. While function approximation has been used successfully in many

cases, caution is required as convergence of the value function is not guaranteed

for all generalisations. Defining the class of functions and conditions to ensure

convergence is an active research topic.

Model minimisation (Dean and Givan, 1997) can be interpreted as a type of func-

tion approximation. Approximation is a misnomer in this case, as the value function

is compacted without loss of accuracy. The states of an MDP are partitioned into

blocks such that the states in each block behave in the same way. States are re-

quired to have the same probability and reward function for transitions to other

blocks and hence all states in the one block have the same value. Dean and Givan

(1997) use Bayesian networks (Pearl, 1988) to implicitly encode variable dependen-

cies. Model minimisation is similar to the state-space abstraction by Boutilier and

Dearden (1994) and compacts a value function without loss. Ravindran and Barto

(2002) generalise model minimisation to symmetries in MDP models.

Model minimisation is implicitly applied to Markov sub-space regions as one

type of state abstraction in HEXQ, as will become apparent.

1the origins of which date back to Michie’s 1961 “MENACE” (Matchbox Educable Noughtsand Crosses Engine) “computer” built out of matchboxes and beads (Michie, 1986).

Background 24

3.2 Hierarchical Methods

Hierarchical methods rely on decomposing a problem into smaller parts. The so-

lution involves multiple levels or stages of decision making that together solve the

whole problem. To apply a hierarchical approach someone has to decompose the

problem. This task is usually left to the designer of the algorithm, however, many

researchers point to the desirability of learning decompositions automatically.

The rest of this chapter will focus on hierarchical methods and explore three

themes:

Multi-level Learning. There is much support for the idea that learning in multi-

level stages has the capability to overcome the “curse of dimensionality”.

Learning common sub-problems may be leveraged in subsequent learning. If

the solutions to the sub-problems are used to inductively bias the learning at

the next level, the search space can be reduced. Iteration over multiple levels

creates a scaffolding effect with potentially multiplicative effects for scaling

learning algorithms in complex environments.

Hierarchical Reinforcement Learning. Hierarchical reinforcement learning is

commonly structured using a gating mechanism that learns to switch in one of

a number of more detailed policies as abstract actions. Gating mechanisms can

be cascaded to produce multiple levels of control. Recently proposed frame-

works have much in common and all use a semi-MDP formalism to model

abstract actions.

Automatic Decomposition. This section reviews various approaches to auto-

mate the decomposition of problems. The selection of reviews has not been

limited to reinforcement learning. The general thrust is to automate the

discovery of components of hierarchical learners such as identifying Markov

reusable sub-regions and identifying useful sub-goals.

Background 25

3.3 Learning Hierarchical Models in Stages

Many researchers have come to the conclusion that learning in stages is necessary

to learn effectively in complex environments.

Simon (1996) describes hierarchical systems as ones composed of interrelated

sub-systems, each in turn being hierarchical in structure until the lowest level of

elementary sub-systems is reached. What is elementary is relative and the subject

is broached again in Chapter 8 with variable resolution models. The lowest level of

resolution available to agents is their sensor-effector interface with their environment.

Why hierarchy? Simon (1996) notes that empirically, a large proportion of com-

plex systems seen in nature, exhibit hierarchical structure. On theoretical grounds,

hierarchy allows the complex to evolve from the simple. From a dynamical view-

point, hierarchical systems have the property of near decomposability, simplifying

their behaviour and description.

In his analysis of the evolution of complex systems, Simon comes to the conclu-

sion that, “Complex systems will evolve from simple systems much more rapidly if

there are stable intermediate forms”.

Ashby (1952, 1956) talks about amplifying the regulation of large systems in

stages. For example, a manager looks after mechanics who look after air-conditioners.

Ashby calls these hierarchical control systems “ultra-stable”. The mechanics are

hired and fired based on performance and they in turn replace parts or the whole

machine in case of malfunction or other changes. One feature of this type of dy-

namic hierarchy is that systems at higher levels tend to make decisions on longer

time scales. The advantage is that the regulatory load is reduced for the supervising

system and the sub-system is given time to adapt to any changes. Ashby’s conclu-

sion, “... the provision of a small regulator at the first stage may lead to the final

establishment of a much bigger regulator so that the process shows amplification.”

Among the basic properties of complex adaptive systems, Holland (1995) lists (1)

aggregation and (2) building blocks. Aggregation refers to forming categories or se-

Background 26

lecting salient features and in a second sense aggregating behaviours of sub-systems.

Building blocks capture repetition in models. They serve to impose regularity in a

complex world. Holland’s conclusion, “We gain significant advantage when we can

reduce the building blocks at one level to interactions and combinations of building

blocks at a lower level.”

In The Society of Mind, Minsky (1985) phrases it this way, “Achieving a goal

by exploiting the abilities of other agencies [...] is the very power of societies. No

higher-level agency could ever achieve a complex goal if it had to be concerned with

every small detail [...]”

In his famous paper on six different representations for the missionaries and

cannibals problem, Amarel (1968) showed foresight into the possibility of making

machine learning easier by discovering regularities and subsequently using them for

formulating new representations.

Clark and Thornton (1997) present a persuasive case for modelling complex

environments (type-2 problems in their language), namely, that it is necessary to

proceed bottom up and solve type-1 problems (tractable as originally coded) as

intermediate representations. In their words, “[...] the underlying trick is always

the same; to maximise the role of achieved representations, and thus minimise the

space of subsequent search”.

Stone (1998) advocates a layered learning paradigm to complex multi-agent sys-

tems in which learning a mapping from an agent’s sensors to effectors is intractable.

The principles advocated include problem decomposition into multi-layers of ab-

straction, learning tasks from the lowest level to the highest in a hierarchy where

the output of learning from one layer feeds into the next layer.

More recently Utgoff and Stracuzzi (2002) point to the compression inherent in

the progression of learning from simple to more complex tasks. They suggest a

building block approach, designed to eliminate replication of knowledge structures.

Agents are seen to advance their knowledge by moving their “frontier of receptivity”

Background 27

as they acquire new concepts by building on earlier ones from the bottom up. Their

conclusion, “Learning of complex structures can be guided successfully by assum-

ing that local learning methods are limited to simple tasks, and that the resulting

building blocks are available for subsequent learning.”

Other examples that support this view include (1) Sammut’s (1981) Marvin, a

program that first learns simple concepts that are then used to learn the descrip-

tion of more complex concepts; (2) constructive induction - any form of induction

that generates new descriptors not present in the input data (Dietterich and Michal-

ski, 1984) and (3) structured induction (Shapiro, 1987) in which the user guides a

machine to learn sub-concepts that are used to find the solution to more complex

problems.

There is an important idea common to all these examples. It appears that one

way to scale learning algorithms in complex environments is to learn in stages,

starting with simple concepts for repetitive situations. These simple concepts are

used as inductive bias at higher levels, building a multi-layered abstraction hierarchy.

This principle underlies the work on hierarchical reinforcement learning in this thesis.

3.4 Hierarchical Reinforcement Learning

This section reviews hierarchical approaches for manually decomposed problems

to provide the orientation for the later discussion on more automatic methods of

decomposition.

Hierarchical reinforcement learners can be viewed as gating mechanisms that, at

a higher level, learn to switch in appropriate and more reactive behaviours at lower

levels.

Ashby (1952) proposed just such a gating mechanism (see figure 3.1) for an agent

to handle recurrent situations. Even at this time it was envisaged that the switched-

in behaviours can be learnt adaptions to repetitive environmental “disturbances”.

Background 28

Figure 3.1: Ashby’s 1952 depiction of a gating mechanism to accumulate adaptionsfor recurrent situations.

The behaviours are accumulated and switched in by “essential” variables “working

intermittently at a much slower order of speed”. The gating mechanism and the

lower level behaviours can of course be hard-coded by a designer and do not have

to be learnt. Robots using a subsumption architecture (Brooks, 1990) provide an

example. Ashby (1956) also recognised that there is no reason that the gating

mechanism should stop at two levels and that the principle could be extended to

any number of levels of control.

Watkins (1989) is often referenced for his contribution to reinforcement learning

of the Q action-value function which allows incremental model free learning (without

having to explicitly store the transition or reward functions). In his thesis he also

discusses the possibility of hierarchical control consisting of coupled Markov decision

problems at each level. In his example, the top level, like the navigator of an 18th

century ship, provides a kind of gating mechanism, instructing the helmsman on

which direction to sail. Watkins did not implement a hierarchical control system in

his thesis, but did raise some pertinent observations and issues about hierarchical

reinforcement learning: (1) at higher levels, decision are made at slower rates, (2)

there is a distinction between global optimality and optimality at each level, (3)

Background 29

the possibility of learning concurrently at different levels and (4) different reward

structures may be necessary at each level.

Singh (1992) developed a gating mechanism called Hierarchical-DYNA (H-DYNA),

an extension to DYNA (Sutton and Barto, 1998). DYNA is a reinforcement learner

that uses both real and simulated experience after building a model of the reward

and state transition function. H-DYNA first learns elementary tasks such as to

navigate to specific goal locations. Each task is treated as an abstract action at a

higher level of control and is able to be switched in by a gating mechanism. The

gating mechanism itself is a reinforcement learner. For example, the elementary

tasks may be to navigate to two separate positions A and B. Once each elementary

task has been learnt, in order to learn the composite task (go to A first then B),

H-DYNA only needs to learn to switch in the abstract actions navigate-to-A fol-

lowed by navigate-to-B. Treating sub-tasks as abstract actions is a common theme

in hierarchical reinforcement learning.

In their “feudal” reinforcement learning algorithm, Dayan and Hinton (1992)

emphasise another desirable property of hierarchical reinforcement learners - state

abstraction. They call it “information hiding”. It is the idea that decision mod-

els should be constructed at coarser granularities further up the control hierarchy.

Their learner is essentially a multi-level gating mechanism, described as a strict

management hierarchy, where managers have sub-managers that work for them and

bosses that they work for in turn. Reward is administered by each level manager

to the subordinate agent depending on whether the instructions were carried out

successfully, rather than whether the manager succeeded or failed. Feudal learning

initially takes longer to improve its performance and then learns more rapidly than

ordinary learners, as higher level managers take advantage of the acquired skills of

their subordinates in new situations.

In the hierarchical distance to goal (HDG) algorithm, Kaelbling (1993) intro-

duced the important idea of composing the value function from distance components

Background 30

along the path to a goal. HDG is modelled on navigation by landmarks. The idea is

to learn and store local distances to neigbouring landmarks and distances between

any two locations within each landmark region. Another function is used to store

shortest-distance information between landmarks as it becomes available from local

distance functions. The HDG algorithm aims for the next nearest landmark on the

way to the goal and uses the local distance function to guide its primitive actions. A

higher level controller switches lower level policies to target the next neighbouring

landmark whenever the agent enters the last targeted landmark region. The agent

therefore rarely travels through landmarks but uses them as points to aim for on its

way to the goal. Interruption of the sub-tasks, before their completion, in order to

improve the policy, is an important idea that is later developed by Dietterich (2000)

as hierarchical greedy execution. When the goal state is in its current landmark

region, the agent goes directly to the goal position. Savings are achieved by com-

posing a higher level distance function from local distance functions, to navigate

between landmarks. Moore et al. (1999) have extended the HDG approach with the

“airport-hierarchy” algorithm. This extension is discussed further below.

Some key features from the above references may be summarised as follows:

• the abstraction of learnt sequences of more primitive actions as a single be-

haviour or sub-skill.

• the use of gating mechanisms at higher levels that learn to switch in sub-skills

to achieve their own ends.

• state abstraction to simplify the model of the environment thereby reducing

the resolution of the state description to coarser levels of granularity.

• decomposition of the value function of the total task into the sum of the

separate value functions of the higher and lower level tasks.

These four elements are inter-related. Higher levels of the control hierarchy

are associated with higher levels of both temporal and state abstraction. From

Background 31

the gating mechanism’s perspective the lower level behaviours may persist over a

number of time steps before control is returned. When the gating mechanism is itself

implemented as a reinforcement learner it is natural to use semi-Markov decision

theory to model the extended nature of its decisions (abstract actions).

The next three examples will look at more recent developments in hierarchical

reinforcement learning building on the commonalities so far.

3.4.1 Options

Sutton et al. (1999a) use the term option to model abstract actions. An option

is described by three components: the set of states from which the option can be

invoked, the policy followed by the option while it is executing and the probability of

the option terminating in any state. A higher level controller can decide to initiate an

option in any of its invoking states. Once the option is started it follows its policy and

stochastically terminates, whereupon control is returned to the controller, allowing

it to select another option. Options are abstract actions in a semi-Markov decision

processes. Sutton et al. (1999a), McGovern (2002) develop the option formalism

that extends reinforcement learning from primitive actions to options in a natural

way to enable learning optimal option polices.

Goal

Figure 3.2: The maze from Chapter 1 reproduced here for convenience.

The authors show, for example, how options can learn faster proceeding on a

Background 32

room by room basis, rather than position by position, in a similar rooms environment

to that shown in figure 3.2. When the goal is not in a convenient location, able

to be reached by the given options, it is possible to include primitive actions as

special case options and still accelerate learning to some extent. For example, with

room leaving options alone, it is not possible to reach a goal in the middle of the

room. Primitive actions are required when the room containing the goal state is

entered. Although the inclusion of primitive actions guarantees convergence to the

globally optimal policy, this may create extra work for the learner. It is not clear

that options alone provide significant advantages (McGovern and Sutton, 1998)

over reinforcement learning acceleration techniques such as eligibility traces (Sutton

et al., 1999a) and prioritised sweeping (Moore and Atkeson, 1993). The answer

may well lie in tailoring options to meet specific sub-goals and giving them greater

priority. Unless options include primitive actions, a globally optimal policy cannot

be assured. With primitive actions the load on higher levels increases and it is

difficult to achieve scaling.

3.4.2 HAMs

Parr (1998) also uses the semi-MDP framework to model abstract actions. He

reformulates an MDP as a Hierarchy of Abstract Machines (HAMs). In the HAM

approach, policies in the overall MDP are constrained by defining a stochastic finite

state controller as a program that produces actions as a function of the agent’s

sensor state. The underlying abstract machines have (1) action states that specify

an action to be taken in the current environment state, (2) call states that can

execute another machine as a subroutine, (3) stop states that return control to the

calling machine and (4) choice states where action choices are learnt to optimise the

value function.

For example, to tackle the room navigation problem in figure 3.2, a machine

could be specified to move the agent in one of four diagonal directions. If each room

Background 33

corner is defined as a choice state to switch direction, this machine could learn to

reach the goal. In this example the base level states can be reduced from 75 to 12 by

removing the uncontrolled or non-choice states from the original problem, forming a

semi-MDP. Clearly the quality of the solution of the original MDP depends heavily

on the specification of the underlying machine for the HAM. Global optimality

guarantees are not possible.

On state aggregation Parr (1998) says, “One of the unsatisfying aspects of the

more formally justifiable state aggregation methods is that they fail to capture much

of the intuitive notion of an abstract state”. People reason about objects like rooms

as if they were single states, yet HAMS are unable to achieve this. This issue is

addressed with algorithms like MAXQ (and HEXQ) where, for example, the three

rooms from figure 3.2 are treated as single abstract states.

Andre and Russell (2002) have recently built on this approach with ALisp, a

programming language that effectively extends the HAM approach. They have also

introduced function decomposition and state abstraction along the lines of MAXQ by

extending the decomposition with a three part equation that ensures hierarchically

optimal solutions for reusable sub-tasks.

3.4.3 MAXQ

Dietterich (2000) formalises an approach to hierarchical reinforcement learning,

called MAXQ, which incorporates all the aforementioned elements. With MAXQ

an episodic MDP is manually decomposed into a hierarchical directed acyclic graph

of sub-tasks. This structure of sub-tasks is called a MAXQ graph and has one top

(root) sub-task. Each sub-task is a smaller (semi-)MDP. In decomposing the MDP

the designer specifies the active states and terminal states for each sub-task. Ter-

minal states are typically classed either as goal terminal states or non-goal terminal

states. Using pseudo-reward disincentives for non-goal terminal states, policies are

learnt for each sub-task to encourage them to terminate in goal terminal states. The

Background 34

actions defined for each sub-task can be primitive actions or other (child) sub-tasks.

Each sub-task can invoke any of its child sub-tasks as abstract actions. When a

task enters a terminal state it, and all its children, abort and return control to the

calling sub-task.

MAXQ has a number of notable features. MAXQ represents the value of each

state in a sub-task as a decomposed sum of completion values. A completion value is

the expected discounted cumulative reward to complete the sub-task after taking the

next (abstract) action. For the maze in figure 3.2, assume that a designer decomposes

the problem with the root sub-task terminating when the goal is reached and with

four child sub-tasks defined with terminations at each room exit to the North, South,

East and West. To calculate the value of the very top left location, for example, the

minimum completion value after the abstract action to leave the room to the East

or South terminates, is 4. The minimum completion value for each lower sub-task

after the next primitive action towards the room exit terminates is 5. Note that

the room leaving abstract actions are the child tasks invoked by the root task. The

completion value for a primitive action is the expected next primitive reward value.

Therefore the total value of the very top left location is also 11, but composed as

1 + 5 + 5.

Another feature of MAXQ is that it allows sub-task policies to be reused in

different contexts. The price to pay for this reuse is that the internal policies are

not attuned to each external situation and may be sub-optimal when viewed from

outside. Dietterich (2000) illustrates this well with a room that has two separated

goal terminal states representing doorways. The optimal policy for this sub-task

causes the agent to exit via the nearest doorway. If in a particular context one

doorway presents a more desirable exit, the MAXQ agent has no way of knowing

this and will not favour the more desirable exit. In this context the room leaving

policy is sub-optimal. The MAXQ solution is predicated on isolating each sub-task

policy in this way and parent sub-tasks use the isolated optimal policies of their child

Background 35

sub-tasks. Dietterich calls this a recursively optimal solution to the overall MDP

in contrast to a hierarchically optimal solution in which each sub-task is optimised

given its context.

The final MAXQ feature to be highlighted is the opportunity for state abstrac-

tion. State abstraction is key to reducing the storage requirements and improving

the learning efficiency. State abstraction means that multiple states are aggregated

and one completion value stored for the aggregate state in a sub-task. Dietterich

(2000) identifies five conditions for safe state abstraction. State abstractions are

“safe” when the decomposed value functions for all base level states are identical

before and after abstraction. Without state abstraction the learning efficiency of a

hierarchical decomposition can be worse than that of a flat2 learner (see the MAXQ

performance (Dietterich, 2000) on Kaelbling’s 10× 10 maze).

As will be seen in subsequent chapters, safe state abstraction is an integral feature

of HEXQ as it uncovers hierarchical structure in MDPs. The limitation on safe state

abstraction in HEXQ for discounted value functions will be removed in Chapter 7

with the introduction of a supporting on-policy discount function.

It should be noted that MAXQ does not include the reward on completing the

(abstract) action as a part of its completion value. Instead MAXQ includes the

reward on exit as a part of the internal value of a sub-task. This thesis will argue

that rewards on sub-task exit are more naturally explained outside a sub-task and

should not be included in the internal value of a sub-task.

Determining the value of a state or finding the next best (greedy) action to

execute requires a depth first search through the completion values of the sub-tasks

in the MAXQ graph. Large branching factors and in particular deep hierarchies can

be computationally expensive. This issue will be addressed for HEXQ in Chapter

8. The solutions have relevance for MAXQ.

2To distinguish a normal MDP from a hierarchically decomposed structure the former is referredto as flat.

Background 36

3.4.4 Optimality of Hierarchical Structures

The approaches to hierarchical reinforcement learning cited above cannot provide

any guarantees on how close the hierarchical solution is to the optimal solution of

the original MDP when the designer imposes constraints to simplify the problem.

In each of Options, HAMs and MAXQ, the sub-task policies are usually constrained

artificially by the programmer. Options and HAMs have been shown to be hierar-

chically optimal. This means the solution is optimal given the constraints of the

hierarchy. However, hierarchical optimality can be arbitrarily bad. If a designer

chooses a poor underlying machine for the HAM, the overall solution may be hier-

archically optimal but globally poor. Recursive optimality can be arbitrarily worse

that a hierarchically optimal solution as in this case further sub-optimality is intro-

duced by ignoring the context of the sub-task.

In each of the above approaches the hierarchically decomposed problem is ex-

ecuted by invoking abstract actions, possibly recursively, and running each action

until termination of the sub-task. Dietterich calls this hierarchical execution. Ter-

mination is defined stochastically for options, by choice states in HAMs and by

termination predicates in MAXQ. The stochastic nature of MDPs means that the

condition under which an abstract action is appropriate may have changed after the

action’s invocation and that another abstract action may be a better choice. How-

ever, the sub-task policy is locked-in until termination. This sub-optimal behaviour

can be arrested by interrupting the sub-task, as for example in the HDG algorithm

(Kaelbling, 1993). Dietterich calls the procedure hierarchical greedy execution when

each sub-task is interrupted after each primitive action step. The next best action

is recomputed from the top of the task hierarchy. While this is guaranteed to be

no worse than the hierarchically optimal or recursively optimal solution and may be

considerably better, it still does not provide any global optimality guarantees. The

policy constraints imposed by the sub-task and the hierarchical structure may be

such that a globally optimal policy cannot be executed.

Background 37

Interestingly, Dean and Lin (1995) and Hauskrecht et al. (1998) do provide de-

composition and solution techniques that make optimality guarantees, but unfortu-

nately, unless the MDP can be decomposed into very weakly coupled smaller MDPs

the computational complexity is not necessarily reduced. Dean and Lin recom-

pute the region sub-MDPs and the higher level semi-MDP iteratively. Hauskrecht,

Meuleau, Kaelbling, Dean, and Boutilier suggest a set of policies over each sub-

space that cover all combinations of possible external values to within a defined

error (ε-grid approach). They point out that the number of policies required can

be extremely large, even when the external state values are bounded. In turn Parr

(1998) proposes techniques to reduce the policy cache and still retain optimality

guarantees. These techniques require some overhead and can still produce large

policy caches. The other major drawback is that they do not facilitate state ab-

straction. It is not possible in general to abstract whole sub-space regions as the

transitions at the abstract level would not be Markov, but depend on the history,

in particular, on the way the region was entered.

The selection of the policy cache for sub-spaces is a key issue that can trade

off computational complexity against optimality. Re-solving the sub-space value

function iteratively and the ε-grid approach are solutions to MDP decomposition

with optimality guarantees. Constraining termination to handcrafted termination

predicates, as in MAXQ, can minimise the size of the cache, but does not provide any

guarantees. Hauskrecht et al. (1998) suggested and dismissed a heuristic to generate

one policy for each peripheral state of a region, reducing the cache size to d, the

number of peripheral states. The problem is that, once a policy is invoked it will

stubbornly try to reach a peripheral state even though it may have drifted closer

to another, more preferable peripheral state. This issue can largely be overcome

in practice using hierarchical greedy execution (Dietterich, 2000), as the next best

primitive step to take is re-evaluated at each step and takes into consideration

stochastic drift.

Background 38

As mentioned previously, Sutton et al. (1999a) allow primitive actions as op-

tions to guarantee optimality, but again there is no evidence that computational

complexity can be reduced in general.

A variation on the one policy per peripheral state approach is used by HEXQ to

generate a policy cache for each sub-space. While not generally making optimality

guarantees, this approach provides the right conditions for the safe abstraction of

sub-spaces. HEXQ is hierarchically optimal in general and globally optimal for

deterministic transitions when using undiscounted value functions. For stochastic

transitions, HEXQ is globally optimal if the stochasticity is limited to the top level

sub-MDP.

3.5 Learning Hierarchies: The Open Question

In the above approaches the programmer is expected to decompose the overall prob-

lem into a hierarchy of sub-tasks. The programmer must craft appropriate sub-goals

and sub-task termination conditions, decide on safe state abstraction, allocate re-

ward hierarchically or program stochastic finite state controllers with the right choice

points. Any progress towards the aim of training rather than programming agents to

achieve goals in complex environments will require that agents themselves must find

ways of learning and revising their own task hierarchies based on their experience.

The motivation for discovering hierarchy in reinforcement learning is well stated by

Dietterich (2000): “It is our hope that subsequent research will be able to automate

most of the work that we are currently requiring the programmer to do”.

To tackle the task of automating the construction of hierarchies it is necessary

to find ways of identifying the sub-system components, finding the right level of

abstraction at each level and to interface the sub-tasks at each level. The next

subsections will review some approaches to this challenge.

Background 39

3.5.1 Bottleneck and Landmark States

One way to automate the decomposition of a multi-task problem is to find states

that are visited frequently to solve each of the tasks. These states can then become

landmarks or sub-goals that an agent can use to solve larger problems. For example

in the maze of figure 3.2, if the agent is started in a random location in the top left

room it necessarily needs to exit via one of the two doorways on the way to the goal.

The states adjacent to each doorway will be visited more frequently in successful

trials than other states. Both Digney (1998) with the NQL (nested Q learning)

algorithm and McGovern (2002) use this idea to identify sub-goals. Menache et al.

(2002) discover bottleneck states by finding state space cuts where the flow properties

of the state transition graph discovered by the agent are minimal. The agent first

learns a sub-policy to reach the sub-goal. The sub-policy is reused to accelerate the

learning of other goals that have the bottleneck sub-goals as intermediate points.

Digney (1998) suggests using high reinforcement gradients as distinctive areas to

indicate useful sub-goals. Singh (1992) uses landmark states provided for some tasks

as sub-goals to solve composite tasks. He proposes the identification of interesting

landmark states that would make useful sub-goals. Interestingly, Kaelbling (1993)

and later Moore et al. (1999) suggest that, for navigation tasks, performance is

insensitive to the position of landmarks and an (automatic) randomly-generated set

of landmarks does not show widely varying results from more purposefully positioned

ones.

3.5.2 Common Sub-spaces

Rather than looking for intermediate points, another approach is to look for common

behaviour trajectories or common region policies. Thrun and Schwartz (1995) use

the SKILLS algorithm to find policies by growing or contracting seeded sub-space

regions. The idea is to use the partial policies over each sub-space generated from a

variety of tasks to select ones that maximise performance and minimise description

Background 40

length over a number of tasks. The selected skills constrain the policies available

when learning new tasks. McGovern (2002), in a second method to discover ab-

stract actions, searches for common sub-sequences of actions in successful trials of a

learning agent. She also suggests looking for observation sequences and notes that

these may be generalised to unseen situations if sensory readings remain consistent

across tasks. This last point is pertinent to HEXQ which generalises on this basis.

Drummond (2002) detects and reuses metric sub-spaces in reinforcement learning

problems. He delimits each sub-space using the value function gradient between

neigbouring states. A high gradient means that there is an impediment such as

a wall. The doorways into and out of sub-spaces are also detected by the value

function gradient in that an exit is a local maximum and an entry point is a local

minimum. Sub-space value functions are stored and indexed by the nodes of their

fitted polygon. They are compared to new situations using sub-graph matching

and adapted using transformations on their value function. The end result is that

an agent can learn in a similar situation much faster after piecing together value

function fragments from previous experience.

3.5.3 Multi-dimensional States

When the problem state is perceived as a vector of features, sub-systems based

on a subset of these features are good candidates for decomposition. This idea is

exploited by Knoblock (1994) in ALPINE (a hierarchical extension to PRODIGY)

which implements an automated approach to generating abstractions in planning.

In planning, the problem is to find a path from an initial state to a goal state.

The state transition function is defined by operators with preconditions and effects.

A state must meet the preconditions before the operator can be applied and the

effects, usually in the form of add and delete lists, describe the changes to the state

when the operator is applied. A solution to the problem is a sequence of operators

(a policy) that transitions from the initial state to the goal state. Planning is

Background 41

thus closely related to deterministic shortest path problems, a sub-class of MDPs.

By dropping terms, ALPINE looks for an abstract representation of the original

problem. It attempts an automatic hierarchical decomposition by searching for

partitions such that the solution at the abstract level is not affected by the solution at

the detailed level. For example, in the Towers of Hanoi problem, once the largest disc

is placed it need not be disturbed by the placement of smaller discs. While ALPINE

is a hierarchical planner there are similarities with the hierarchical reinforcement

learner HEXQ. The ordered monotonicity property, that ensures literals established

at abstract levels are not changed by refined plans in ALPINE, is related to the

nested Markov ordering requirement in CQ (Hengst, 2000) a precursor algorithm to

HEXQ. The automatic decomposition of the Towers of Hanoi problem with HEXQ

will be covered in Chapter 6.

3.5.4 Markov Spaces

Dayan and Hinton (1992) provide a multi-level partition of the state space with an

increase in resolution between the levels. They show how learning can be improved

by reusing local controllers at the different levels.

Moore (1994) uses the Parti-game algorithm to automate the partitioning of the

state space based on whether the agent becomes “stuck” and fails to move to a

neighbouring cell. By splitting local cells in the stuck region, the state space is

redefined to provide a better Markov representation that can learn to reach the goal

without failure. While this approach is of interest in automatically finding Markov

regions by decomposing the state space, the variable resolution Parti-game algorithm

is not a hierarchical controller in the sense characterised in this thesis. There are no

gating mechanisms at different levels. The algorithm does evolve through different

levels of resolution of controllers, but coarser versions are discarded in favour of finer

variable resolution models.

The UTree algorithm (McCallum, 1995) can be seen in a similar light. UTree

Background 42

increases the resolution of the model by iteratively splitting both on the state space,

or the state and action history, to uncover hidden state. The splitting criteria

for UTree is that the states are Markov with respect to the value function (Utile

distinctions).

Uther (2002) uses a similar tree based method (TTree) to increase the resolution

of abstract states when abstract action “trajectories” initiated in these states give

varying results. TTree is hierarchical in the sense that the trajectories are either

default or user provided fixed policies that are switched in at the abstract level and

continue to execute until termination.

The common theme in these algorithms is that they each increase the resolu-

tion of sub-spaces in an effort to attain predictability (Markov property) in the

model. This will be seen to be an important criteria for HEXQ as sub-space region

boundaries are defined where the Markov property breaks down.

The idea of refining a learnt model based on unexpected behaviour is also devel-

oped by Ryan and Reid (2000). Here a hierarchical model, RL-TOPs, is specified

using a hybrid of planning and MDP models. Planning is used at the abstract level

and invokes reactive planning operators, extended in time, based on teleo-reactive

programs (Nilsson, 1994). These operators use reinforcement learning to achieve

their post-conditions as sub-goals. If the operators are specified too abstractly they

may have negative side-effects and this may ultimately cause failure of the planner to

reach its goal. This is amply demonstrated by Ryan (2002) using the taxi with fuel

task (Dietterich, 2000). If the agent is not told about fuel usage during navigation

it can run out of fuel and fail to deliver passengers. The RL-TOPs planner tries to

uncover hidden state using ILP based on records of +ve and -ve examples. Again,

as in Parti-game, UTree and TTree, the aim is to automatically refine the abstract

state to a description that is Markov with respect to primitive and abstract actions.

HEXQ results for the taxi with fuel problem will be discussed in Chapter 6.

HEXQ discovers the side effect of running out of fuel, constructs the appropriate

Background 43

hierarchy of sub-MDPs and solves this problem optimally with hierarchical greedy

execution.

3.5.5 Other Approaches to Discovering Hierarchy

There is other related research that can be interpreted as discovering hierarchy. For

example, Harries et al. (1998) extract hidden context from training examples. “Con-

text” is another way of looking at the durative setting of the switching mechanism at

a higher level of abstraction. The learner tries to automatically uncover the settings

in time, by interactively sweeping the training set to minimise error rate for each

context period.

De Jong (2002) use co-evolution and genetic algorithms to discover common

building blocks (intermediate level abstractions) that can be employed to represent

larger assemblies (higher level abstractions). The modularity, repetitive modularity

and hierarchical modularity bias of their learner is closely related to the state space

repetition and abstraction used by HEXQ.

Another interesting approach that automates decomposition, specifically for prob-

lems where each “location” is also a goal, is the multi-value-function airport hier-

archy algorithm by Moore, Baird, and Kaelbling (1999). This algorithm builds on

Kaelbling’s original HDG algorithm (Kaelbling, 1993). The airport algorithm relies

on a heuristic to generate successive (and denser) levels of landmark states (air-

ports) in such a way that each region around an airport overlaps higher level airport

locations. In this way the total state space is not partitioned in the mathematical

sense but covered by a decreasing but overlapping set of MDP regions. The regions

are exponentially reduced in size as they grow in number. Ultimately every location

becomes an airport. Policies to reach a region airport are cached at a primitive level

for each region. It is then possible, given any starting location to dynamically com-

pute the set of landmark states in order of decreasing level to reach any goal state.

The overlapping region construction ensures small fractions of regret (deterioration

Background 44

from optimal performance) and the algorithm works directly over a wide range of

stochasticity.

This algorithm is specifically designed for point-to-point movement in problems

where there are few transitions between states.

3.6 Conclusions and Motivation for HEXQ

There is strong support for decomposition and a hierarchial approach to learning in

complex domains.

Automatically discovering decompositions and building task hierarchies are still

in the early stages of research. Each of the above approaches automate some of the

common elements of hierarchical learners discussed, namely, the multi-level gating

mechanisms, action abstraction, state abstraction and value-function decomposition.

Finding good sub-tasks is clearly important. Discovered sub-tasks should be reusable

multiple times and sub-task policies should be implementable as abstract actions at

higher levels of abstraction.

Despite reviewing various ways of reducing MDP representations, for example

by using the factored representation of Bayes nets, Boutilier et al. (1999) still see

a research gap. They identify the problem as follows: “Unfortunately, we are not

aware of any particular useful heuristics for finding serial decompositions for Markov

decision processes. Developing such heuristics is clearly an area for investigation.”,

and “... the problems of discovering good decompositions, constructing good sets

of macros, and exploiting intensional representations are areas in which clearer,

compelling solutions are required”.

This thesis describes a hierarchical reinforcement learning algorithm, HEXQ,

that automates each of the common elements discussed above. HEXQ takes a flat

reinforcement learning problem, without a prior model, and attempts an automatic

hierarchical decomposition and solution. HEXQ may be imagined as a learning agent

Background 45

that autonomously constructs an interconnected hierarchy of models based purely on

its sense-act interaction with its environment. The sensor state is assumed to make

the environment accessible, in that, state transitions and rewards are Markov with

respect to the current sensor state and next action. The sensor state is described

by a vector of variables representing environmental features. It is the hierarchical

relationship between these features that HEXQ tries to discover and exploit.

HEXQ performs a simple variable-wise decomposition. HEXQ automatically

finds regions of sub-space that it can reuse multiple times. It learns a restricted

set of policies over these regions to provide flexible abstract actions for higher level

models. Higher level models are based on state abstractions that together with the

abstract actions form well defined Markov models. HEXQ uses a decomposed value

function that exactly and compactly represents the “flat” value function for any

hierarchical policy.

HEXQ was inspired by (1) the observation that the scientific method endeavors

to discover Markov sub-problems and could possibly be automated, and (2) the idea

that automatic discovery of structure in sequences (Nevill-Manning, 1996) may be

generalisable to state-action spaces. The various approaches to room grid-world de-

compositions (Singh, 1992, Dayan and Hinton, 1992, Dean and Lin, 1995, Digney,

1996, Hauskrecht et al., 1998, Parr, 1998, Boutilier et al., 1999, Precup, 2000, Sut-

ton et al., 1999b) provided the challenge for early decomposition attempts. Value

function decomposition was directly inspired by MAXQ (Dietterich, 2000).

The next chapters will start with the decomposition of stochastic shortest path

MDPs.

Chapter 4

HEXQ Decomposition

This chapter describes the decomposition of multi-dimensional stochastic shortest

path MDPs. Specific decomposition conditions are stated that may partition the

state space of a multi-dimensional MDP into a hierarchy of smaller sub-spaces.

The aim in this Chapter is to describe the decomposition theoretically. Chapter

5 will describe how the decomposition and solution processes are automated in

practice by the HEXQ algorithm.

HEXQ is the name used throughout this thesis to refer to the decomposition

conditions, the resulting hierarchy of sub-MDPs and the algorithm that automates

the decomposition procedure.

The theory is developed in three sections:

• A multi-dimensional MDP is decomposed into a tree of smaller semi-MDPs

(sub-MDPs). Each policy of a child sub-MDP in the tree is invoked as an

abstract action from its parent sub-MDP.

• The HEXQ decomposition is shown to be hierarchically optimal. Under certain

conditions a globally optimal solution to the original MDP is assured.

• The HEXQ decomposition allows state abstraction. The value function, given

the hierarchy, can be compactly and losslessly represented, reducing storage

46

HEXQ Decomposition 47

Goal

Figure 4.1: A simple example showing three rooms interconnected via doorways.Each room has 25 positions. The aim of the agent is to reach the goal.

requirements and implicitly transferring learning between sub-tasks.

The simple maze discussed in Chapter 1 and reproduced in figure 4.1 is used to

illustrate the basic concepts. This maze interconnects three rooms via doorways.

Two of the doorways lead to a goal. The agent can only sense the cell that it occupies

and move one step vertically or horizontally using actions North, South, East and

West. The thick solid lines indicate barriers through which the agent cannot move.

To keep the illustration simple, all transitions are deterministic with reward −1.

The agent is started at a random location.

4.1 Assumptions

It is assumed that an episodic MDP is provided with all rewards negative and the

discount rate set to one. These MDPs are called stochastic shortest path problems

because the objective is to minimize the distance to termination. The distances

between states are considered to be negative rewards or costs.

A proper policy is a policy that leads to the termination of an MDP, with prob-


ability one, from every initial state. A policy that is not proper is improper. The

optimum policy of a stochastic shortest path problem1 will come from the hypothesis

space of proper policies as all improper policies will have unbounded value functions

for some states. This follows from equation 2.4 in section 2.2 when all rewards are

negative and the discount rate is one.

All further references to MDPs will assume stochastic shortest path conditions

and policies will be assumed proper, until Chapter 7, where the HEXQ decomposi-

tion is extended to general finite MDPs.

Variable values in this thesis are not required to have any metric or ordering

property. They can have arbitrary attribute designations and need not be numeric.

Often variables will be described by the notation x and y, etc. This is not meant

to imply a co-ordinate system. For example, a valid description of one state in the

maze problem in figure 4.1 is (red-room, center-position)2.

4.2 HEXQ Hierarchies

This section describes the decomposition of the state space of an MDP into a par-

tition of regions in such a way that each region can be used to construct smaller

MDPs. Smaller MDPs defined over these regions are solved to generate a cache of

policies. The idea is to solve the overall problem more efficiently by using the cache

of regional policies as abstract actions in a semi-MDP in which the states are an

abstraction of the regions.

1Bertsekas and Tsitsiklis (1991) have generalized stochastic shortest path problems to includenon-negative rewards under the condition that the state value function for improper policies is−∞.

2As a matter of convenience, all algorithmic implementations in this thesis have mapped thevariable values a subset of non-negative integers. Further implementations may use more flexibleand efficient data structures


4.2.1 Partitioning MDPs of Dimension Two

The starting point is to decompose a 2-dimensional MDP. Later, this will be gen-

eralised to any number of dimensions. For a 2-dimensional MDP, the overall state

is represented by two variables, say s = (x, y). The first step is to partition3 the

state space into regions in the context of one of the variables, say the y variable.

The aim is to construct regions with respect to the x variable that (1) have similar

Markov models and (2) allow control over inter-region transitions. These properties

are important for state abstraction, as will be seen later.

The concept of an exit is critical to a HEXQ4 decomposition and will be developed

and illustrated next. A transition from state s to state s′ on action a was introduced

in Chapter 2 and is written s →a s′.

For any partition of the state space, an inter-block transition is a transition

between any two distinct blocks of the partition, such that the probability of the

transition is greater than zero. In other words, s →a s′ is an inter-block transition

if the state space is partitioned into blocks i.e. G = {g1, . . . , gm}, s ∈ gi, s′ ∈ gj,

i 6= j and T ass′ > 0.

A terminating transition is a transition for which the original MDP terminates.

A variant transition (with respect to y) in the (x, y) state space is one with

non-Markov outcomes from the perspective of the x variable value alone. This

arises when either the y variable changes value or the state transition probability

or reward function are different between the same x variable values for different y

variable values. More formally, s →a s′ is a variant transition if for all s = (x, y),

s′ = (x′, y′):

1. y 6= y′ or

2. T ass′ 6= T a

tt′ or Rass′ 6= Ra

tt′ or y′′ 6= y′′′ ∀ t →a t′, t = (x, y′′), t′ = (x′, y′′′)

3see Appendix A4HEXQ was originally derived from the amalgam of Hierarchical, EXit and Q function.


Definition 4.1 An exit is the state-action pair (s, a) from any inter-block, termi-

nating or variant transition, s →a s′.

For any exit (s, a), s is referred to as the exit state and a as the exit action. An

agent is said to execute exit (s, a) if it takes exit action a from state s. The entry

states of a block are all those states in the block reached in one step following the

execution of an exit from any block, or, the states in the block that belong to the

set of starting states of the MDP. The set of exit states of a block g is designated

Exits(g) and the set of entry states Entries(g).

Definition 4.2 A HEXQ Partition is a partition of the states of an MDP into

blocks such that, within each block, each exit state can be reached eventually (with

probability one) from any entry state using only non-exit actions. The blocks of a

HEXQ partition will be referred to as regions.

Figures 4.2 to 4.4 give examples of exits and HEXQ partitions. Chapter 5 will

describe how the HEXQ algorithm, using strongly connected components of the

underlying state transition graph, can form these regions. The objective at this

stage is to verify that the regions and exits are consistent with their definitions. In

the figures the circles represent base level states. The squares with rounded corners

represent the regions of a partition. Each instance of a state is given by subscripted

pairs of x and y variable values, eg (x3, y1). Transitions between states are indicated

by arrows, labelled with associated actions a, b or c, joining the base level states.

Figure 4.2 shows a stochastic transition from state (x3, y0) on action b. The next

state may be (x2, y1) where the y variable has changed value. The pair ((x3, y0), b) is

therefore an exit because it is associated with a variant transition. By condition 2 of

variant transitions, ((x3, y1), b) is therefore also an exit. States (x1, y0) and (x2, y1)

are both entry states. It is not possible to reach exit state (x3, y1) from entry

state (x0, y0) with non-exit actions, necessitating the two regions. Because exits are

generated when the y variables change value, the coarsest possible partition is one


region0

x0 x1

x2 x3

y0

region1

x0 x1

x2 x3

y1Entry state

Exit state

Entry state

Entry stateExit state

a

a

a a

a

a

b b

Figure 4.2: For transition (x3, y0) →b (x2, y1) the y variable changes value. As thisis a variant transition, ((x3, y0), b) is an exit. If all states were in the one region,then entry state (x0, y0) cannot reach exit state (x3, y1) using non-exit actions. Tworegions are therefore necessary to meet the HEXQ partition requirements.

in which each region is associated with one of the y variable values, creating |Y |regions.

Figure 4.3 illustrates an example where multiple regions are required even though

all base level states share the same y variable value. Multiple regions are necessary

in this case to ensure each entry state can reach each exit state. The exit ((x3, y0), b)

is created as the underlying transition is inter-block.

Figure 4.4 shows an exit created as a result of condition 2 of variant transitions.

Here the reward values for transitions (x0, y0) →b (x1, y0) and (x0, y1) →b (x1, y1)

differ. Exits are similarly created if the transition probabilities vary between the

same x variable values for different y variable values.

In formulating an MDP description of the rooms problem in figure 4.1, the state

space can be described in a variety of ways. Figure 4.5 shows two alternatives. In

(a) the states are described by the variables, x =position-in-room and y =room-

number. This could correspond to two sensors of a robot, one that observes its

location relative to walls and the other the room colour. In (c), the states are


region0

x0 x1

x2 x3

y0

region1

x4 x5

x6 x7

y0Entry state

Exit state

Entry state

Entry state

Entry state

Exit state

b

a aa

a

a

a

cc

Figure 4.3: In this example all states have the same y value. If all states are in theone region, then entry state (x5, y0) cannot reach exit state (x3, y0). Therefore, tworegions are necessary to meet the HEXQ partition requirements. The inter-blocktransition means that ((x3, y0), b) is an exit.

region0

x0 x1

x2 x3

y0

region1

x0 x1

x2 x3

y1Entry and Exitstate

Exit stateExit state

Entry stateEntry stateEntry and Exit

state

r = - 1 r = - 2a

a

a a

a

a

b b

b b

Figure 4.4: The transitions (x0, y0) →b (x1, y0) and (x0, y1) →b (x1, y1) have differ-ent associated rewards and hence give rise to exits ((x0, y0), b) and ((x0, y1), b) bycondition 2 of variant transitions.


described by their x and y coordinates as may be the case if the robot has a GPS

like sensor. Figure 4.5 (b) and (d) show the respective HEXQ partitions of regions

for each of the two state space descriptions in (a) and (c). The three regions in

(b) are the rooms. If the position-in-room variable values differ for each room then

HEXQ may not find any useful decomposition to allow generalisation. Each region

in the partition may then be a single state in which case there is no benefit to the

decomposition, underlining the importance of representation.

(0,0) (1,0) (2,0) (3,0) (4,0)

(5,0) (6,0) (7,0) (8,0) (9,0)

(10,0) (11,0) (12,0) (13,0) (14,0)

(15,0) (16,0) (17,0) (18,0) (19,0)

(20,0) (21,0) (22,0) (23,0) (24,0)

(0,1) (1,1) (2,1) (3,1) (4,1)

(5,1) (6,1) (7,1) (8,1) (9,1)

(10,1) (11,1) (12,1) (13,1) (14,1)

(15,1) (16,1) (17,1) (18,1) (19,1)

(20,1) (21,1) (22,1) (23,1) (24,1)

(0,2) (1,2) (2,2) (3,2) (4,2)

(5,2) (6,2) (7,2) (8,2) (9,2)

(10,2) (11,2) (12,2) (13,2) (14,2)

(15,2) (16,2) (17,2) (18,2) (19,2)

(20,2) (21,2) (22,2) (23,2) (24,2)

Goal(a)

(b)

(0,9) (1,9) (2,9) (3,9) (4,9)

(0,8) (1,8) (2,8) (3,8) (4,8)

(0,7) (1,7) (2,7) (3,7) (4,7)

(0,6) (1,6) (2,6) (3,6) (4,6)

(0,5) (1,5) (2,5) (3,5) (4,5)

(5,9) (6,9) (7,9) (8,9) (9,9)

(5,8) (6,8) (7,8) (8,8) (9,8)

(5,7) (6,7) (7,7) (8,7) (9,7)

(5,6) (6,6) (7,6) (8,6) (9,6)

(5,5) (6,5) (7,5) (8,5) (9,5)

(0,4) (1,4) (2,4) (3,4) (4,4)

(0,3) (1,3) (2,3) (3,3) (4,3)

(0,2) (1,2) (2,2) (3,2) (4,2)

(0,1) (1,1) (2,1) (3,1) (4,1)

(0,0) (1,0) (2,0) (3,0) (4,0)

Goal(c)

(d)

Figure 4.5: HEXQ partitioning of the maze in figure 4.1. The state representationeffects the partitioning. In (a) the agent uses a position in room and room sensorresulting in three regions (b). In (c) the agent uses a coordinate like sensor, thatpartitions the state space into the 15 regions shown in (d).

In the second case, (c), the HEXQ partition results in 15 regions as shown in

figure 4.5 (d). In this case every state is an exit state, because a North or South

action for any x variable value may change the y variable value. The regions are

divided into two ranges for the x values, one from 0 to 4 and one from 5 to 9. The


subdivision of regions into these two ranges is created because of variant transitions.

For example, a transition from state (4, 8) to (5, 8) is not possible, but the transition

from (4,7) to (5,7) is possible. If similar x values on action East do not have the

same state transition function for all y values, exits are created. The 10 exits are

((4, ·), East) with the y variable value ranging from 0 to 9. These exits prevent some

entry states reaching exit states requiring the two regions for the same y variable

value. Moving between states (3, y) and (4, y) with similar actions is possible with

reward −1 for all y values. These transitions do not give rise to exits.

The vector of variables used to describe a multi-dimensional MDP has no reason

to order the variables in any specific way. The foregoing decomposition conditions

assumed regions to be formed with respect to the x variable and in the context of

the y variable. For any particular problem, the assignment of the x and y variables

in the multi-dimensional state vector may be interchanged for the purposes of de-

composition. This may result in a different partition. Chapter 5 will show how the

variables are heuristically ordered in practice to find good partitions, but the HEXQ

partitioning process does not require a particular order for the variables per se.

4.2.2 Sub-MDPs

Given a HEXQ partition, it is possible to construct sub-MDPs using the states in

each region and the transition and reward functions of the original MDP.

A smaller stochastic shortest path MDP can be constructed by taking a region

of a HEXQ partition, the non-exit transition and reward functions of the original

MDP, modelling one exit of the region as a zero reward transition to an absorbing

state and making the value function the undiscounted sum of future rewards.

This is a valid sub-MDP as the transition and reward functions are Markov

by virtue of being inherited from the original MDP. Each exit state of the region is

proper because it can be reached from all entry states as a requirement of the HEXQ

partition. The construction therefore satisfies the requirements for a stochastic


shortest path MDP.

In this way it is possible to construct one MDP for each exit of every region

in the HEXQ partition of the original MDP. To distinguish these MDPs from the

original MDP they are referred to as sub-MDPs. It is important to note that a

policy can only leave the region via the specified exit because other exit actions are

disallowed by construction of the sub-MDP.

4.2.3 Top level semi-MDP

A semi-MDP can be constructed to find a proper policy for the original MDP. A

semi-MDP m is defined from the states of the original MDP and abstract actions

that are proper policies for the sub-MDPs constructed in section 4.2.2.

To see that this forms a semi-MDP consider the value of state s. By definition it

is the expected sum of future primitive rewards for a policy π(s) that by construc-

tion invokes sub-MDP policies (abstract actions) and follows them until their exit,

whereupon another sub-MDP is invoked.

V πm(s) = E{r1 + r2 + r3 . . .} (4.1)

Let N be the random number of single step transitions, starting in state s,

executing abstract action a and terminating in state s′ (s →a s′). The value function

can then be written as the expectation of the sum of two series.

V πm(s) = E{

N−1∑n=1

rn +∞∑

n=N

rn} (4.2)

The first series is the expected sum of the rewards to termination of the abstract

action in state s′ including the reward on exit and the second is the value of the

state s′ reached following termination.


V πm(s) =

∑

s′,N

TNass′ [RNa

ss′ + V πm(s′)] (4.3)

This defines a semi-MDP as equation 4.3 has the form of a Bellman equation for

a semi-MDP (see equation 2.17).

The HEXQ partitioning and construction of the sub-MDPs will be called a HEXQ

decomposition. While the treatment here is similar to that of other researchers,

(Dean and Lin (1995), Parr (1998), Hauskrecht et al. (1998), Dietterich (2000)) a

key difference in the HEXQ formulation of the semi-MDP is that abstract actions

terminate on exit execution whether the region is left or not. An aggregate state

(or region) can therefore transition to itself.

Region policies and termination after exit execution are a specific form of options

(Sutton et al., 1999a) in which the probability of termination is 1 when executing

an exit and 0 otherwise.

Solving this semi-MDP provides a solution to the original MDP constrained by

the abstract actions. As for HAMs, MAXQ and options, this policy is in general

non-stationary and sub-optimal. It is non-stationary because the primitive actions

taken in some states may depend on which abstract action was executed previously

from an entry state. It may be sub-optimal because sub-MDP policies have been

restricted to those producing proper single exits. Section 4.3 explores the optimality

of HEXQ decompositions in greater detail.

To illustrate the generated hierarchical structure for the simple maze with the

state space and partition shown in figure 4.5 (a) and (b), the top semi-MDP and

12 sub-MDPs for the regions are shown in figure 4.6. Each region has four exits

and therefore generates four sub-MDPs. All states are entry states as they are

all possible initial states. The exit states in the top semi-MDP are highlighted by

shading. Starting in any state, say (3, 0), there are four abstract actions that can be

chosen corresponding to the policies of the sub-MDPs. For state (3, 0) the relevant

sub-MDPs are those leftmost in figure 4.6. There is one sub-MDP and policy to leave


(0,0) (1,0) (2,0) (3,0) (4,0)

(5,0) (6,0) (7,0) (8,0) (9,0)

(10,0) (11,0) (12,0) (13,0) (14,0)

(15,0) (16,0) (17,0) (18,0) (19,0)

(20,0) (21,0) (22,0) (23,0) (24,0)

(0,1) (1,1) (2,1) (3,1) (4,1)

(5,1) (6,1) (7,1) (8,1) (9,1)

(10,1) (11,1) (12,1) (13,1) (14,1)

(15,1) (16,1) (17,1) (18,1) (19,1)

(20,1) (21,1) (22,1) (23,1) (24,1)

(0,2) (1,2) (2,2) (3,2) (4,2)

(5,2) (6,2) (7,2) (8,2) (9,2)

(10,2) (11,2) (12,2) (13,2) (14,2)

(15,2) (16,2) (17,2) (18,2) (19,2)

(20,2) (21,2) (22,2) (23,2) (24,2)

Goal

0 1 2 3 45 6 7 8 9

10 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 9

10 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 9

10 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 9

10 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 9

10 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 9

10 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24

0 1 2 3 45 6 7 8 9

10 11 12 13 1415 16 17 18 1920 21 22 23 24

Figure 4.6: The HEXQ tree for the simple maze showing the top level semi-MDPand the 12 sub-MDPs, 4 for each region. The numbers shown for the sub-MDP arethe position-in-room variable values.

the room to the North, East, South and West. The agent will learn that choosing

the abstract action that leaves the room to the East is optimal when starting in state

(3, 0) as this is the shortest way to the goal. When this abstract action terminates

in state (10, 1), another abstract action is chosen, this time the policy corresponding

to the top right sub-MDP in figure 4.6 which will lead the agent directly to the goal.

It is evident from figure 4.6 that some of the sub-MDPs are identical. HEXQ

will merge some of the leaves of this tree to form a more compact directed acyclic

graph representation to be described later in this Chapter. In the next section the

HEXQ decomposition is generalized to higher dimensional MDPs.


4.2.4 Higher Dimensional MDPs

The HEXQ decomposition of MDPs with dimension greater than 2, say d, is based

on the recursive application of the above decomposition with two state variables.

The variables in the state vector are grouped into two sets {s1, s2, . . . , sd−1} and

{sd}. A new state vector is defined with two state variables, the first being the

cartesian product of the d − 1 variables. Hence s = (s1 × s2 × . . . × sd−1, sd).

The HEXQ decomposition is applied as in section 4.2.3. Each sub-MDP is now a

stochastic shortest path MDP in its own right following the constructions in section

4.2.2 with a state vector of effectively d−1 variables as the variable sd is constant for

each sub-MDP. Each sub-MDP can be further HEXQ decomposed by formulating

an equivalent MDP with the two state variables s = (s1 × s2 × . . .× sd−2, sd−1). In

this way the original MDP is recursively decomposed, one variable at a time. The

recursive decomposition halts when the sub-MDPs have been reduced to one state

variable. A tree (HEXQ tree) of sub-MDP nodes is generated in this way with a

depth of d− 1. The levels in the tree are labelled with consecutive integers from the

leaf node sub-MDPs (level 1) to the root node sub-MDP (level d).

All the sub-MDPs in the tree are semi-MDPs except for the leaf node sub-MDPs5.

4.2.5 Value Function Decomposition with HEXQ Trees

The aim of this subsection is to derive the equations for the recursive representation

of the value functions of a state s in sub-MDP m in the HEXQ tree executing a

hierarchical policy. The value function decomposition was motivated by Dietterich

(2000) who in turn built on previous work by Singh (1992), Kaelbling (1993), Dayan

and Hinton (1992), Dean and Lin (1995).

5All the HEXQ results for stochastic shortest path MDPs as defined here apply just as wellif the leaf node sub-MDPs were semi-MDPs. As the discount factor is 1, the value function andoptimum policy do not depend on the timing of base-level state transitions.


Definition 4.3 A hierarchical policy is a set of proper policies, one for each

sub-MDP in the HEXQ tree. So for sub-MDPs 1, 2, . . . , n hierarchical policy π =

{π1, π2, . . . , πn}.

s sa s’

sub-MDP ma

sub-MDP m

aNssa

R ,1−

NassR '

assa

R '

Figure 4.7: An example trajectory under policy π, for N = 4 steps, where sub-MDP m invokes sub-MDP ma using abstract action a, showing the sum of primitiverewards to the exit state sa and the primitive reward on exit.

For sub-MDP m, assume that the current abstract action, π(s) = a, has invoked

sub-MDP ma and will terminate in state s′, after a random number of N time-steps,

for a given hierarchical policy π. The reward on exit of sub-MDPs is not included

in the value function by construction. Therefore, to use the values of the states

from sub-MDP ma in equation 4.2 the reward to exit is split into the cumulative

reward RN−1,assa

to reach the exit state sa and the expected primitive reward on exit

Rasas′ . The notation V π

m(s′) refers to the value of state s′ in sub-MDP m where all

sub-MDPs invoked by m follow the hierarchical policy π. Figure 4.7 illustrates the

components of the total reward for transitioning from state s through to state s′ as

a result of taking abstract action a that invokes sub-MDP ma.


V πm(s) =

∑

s′,N

TNass′ [RN−1,a

ssa+ Ra

sas′ + V πm(s′)] (4.4)

Theorem 4.4 For any proper sub-MDP policy in the HEXQ tree the next state on

exit and the expected reward on exit are independent of the entry state.

Proof For a region in a HEXQ partition, any exit state can be reached from

every entry state using non-exit actions, by definition. By construction, any partic-

ular sub-MDP defined over a region has all exits disallowed except for one that is

modelled as an absorbing state (subsection 4.2.2). A proper sub-MDP policy must

therefore use this exit independently of the entry state. The independence of the

exit makes the reward on exit and the next state independent of the entry state

because of the Markov property of the transition and reward functions.

The first term on the RHS of equation 4.4 is the value of state s in sub-MDP

ma as it represents the expected sum of rewards to the exit state of sub-MDP ma

for abstract action a. Because discount factor γ = 1, γ does not appear in these

equations. As rewards on exit and the next state s′ are independent of N , the

expectation can be taken with respect to the exit reward and state s′ alone. Hence

V πm(s) = V π

ma(s) +

∑

s′T a

sas′ [Rasas′ + V π

m(s′)] (4.5)

where T asas′ is the probability of transitioning to state s′ after abstract action a

terminates and Rasas′ is the expected final primitive reward on transition to state s′

when abstract action a terminates. Equation 4.5 decomposes the value function for

a state recursively.

Definition 4.5 The HEXQ action-value function E (or exit value for short)6

6While the HEXQ function E is similar to Dietterich’s completion function there are importantdifferences. In particular, the inclusion of the primitive reward on exit in the HEXQ function willbe seen to have advantages, such as, greater compaction of the value function representation, theautomatic generation of the decomposition hierarchy and automatic hierarchical credit assignment.


of state s for abstract action, a, in sub-MDP m, is the expected value of future

rewards after completing the execution of the abstract action, a, starting in state

s and following the hierarchical policy π thereafter. Note that E includes the ex-

pected primitive reward on exit, but does not include any rewards accumulated while

executing the sub-MDP associated with a.

Eπm(s, a) =

∑

s′T a

sas′ [Rasas′ + V π

m(s′)] (4.6)

Substituting equation 4.6 into equation 4.5 gives

V πm(s) = V π

ma(s) + Eπ

m(s, a). (4.7)

For a 1-dimensional MDP and for all level 1 HEXQ sub-MDPs of a multi-

dimensional MDP, the HEXQ function E is identical to the normal Q function.

In this case the value of term V πma

in equation 4.7 is zero (since there are no internal

transitions in a primitive state) and V πm(s) = Eπ

m(s, a) = Qπm(s, a) where a = π(s).

The HEXQ function E can be interpreted as a hierarchical generalisation of the Q

function.

Given a HEXQ tree of depth d − 1 and a hierarchical policy π which invokes a

nested set of abstract actions ad, ad−1, . . . , a1 with corresponding sub-MDPs d, d −1, . . . , 1 the value of state s is decomposed as

V πd (s) = Eπ

1 (s, a1) + Eπ2 (s, a2) + . . . + Eπ

d (s, ad). (4.8)

Figure 4.8 shows an example of how the value of state (3, 0) is composed from the

E values in the sub-MDP for the top left room plus the E value in the top sub-MDP.

It is assumed that the hierarchical policy at the top level invokes the abstract action

that moves the agent out of the East doorway and, for the top left room sub-MDP,

the next action is down. As the reward is −1 for each step, the E value to the

doorway is −3 and the E value for the abstract action to exit and then to reach the


(0,0) (1,0) (2,0) (3,0) (4,0)

(5,0) (6,0) (7,0) (9,0)

(10,0) (11,0) (12,0)

(15,0) (16,0) (17,0) (18,0) (19,0)

(20,0) (21,0) (22,0) (23,0) (24,0)

(0,1) (1,1) (2,1) (3,1) (4,1)

(5,1) (6,1) (7,1) (8,1) (9,1)

(11,1) (12,1) (13,1) (14,1)

(18,1) (19,1)

(20,1) (21,1) (23,1) (24,1)

(0,2) (1,2) (2,2) (3,2) (4,2)

(5,2) (6,2) (7,2) (8,2) (9,2)

(10,2) (11,2) (12,2) (13,2) (14,2)

(15,2) (16,2) (17,2) (18,2) (19,2)

(20,2) (21,2) (22,2) (23,2) (24,2)

( ) ( ) ( )

8

53

__),0,3(),0,3()0,3( 212

−=−−=

+= eastroomleaveEsouthEV πππ

Figure 4.8: The value of state (3, 0) is composed of two HEXQ E values.

goal is −5.

This section has illustrated how a multi-dimensional MDP is HEXQ decomposed

into a tree of sub-MDPs. The value of any state is the sum of exit values of the

sub-MDPs invoked along a path from the root to a leaf node of the HEXQ tree.

4.3 Optimality of HEXQ Trees

The question arises as to whether decomposed MDPs can provide an optimal so-

lution to the original MDP. This section will show that a HEXQ decomposition is

hierarchically optimal in general and globally optimal under certain conditions.

The solution to a hierarchically decomposed MDP can be classified by different

kinds of optimality. It is of course desirable for the hierarchically decomposed MDP

to achieve the globally optimal solution to the original MDP. As far as is known,

there is no general method that can solve all MDPs both with reduced computational

complexity and optimally by decomposition. If it is not possible to achieve an


optimal solution then the next best outcome is to guarantee that a solution is ε-

optimal. This means that, for all states, the value function can be guaranteed to

be close to optimal. Parr (1998) suggests generating an ε-optimal policy cache over

the regions of a decomposition. A policy cache is ε-optimal if for any set of values

of termination states of a region, a policy can be found such that after one value

iteration, the value of any state in the region does not change by more than ε. It is

possible to trade off optimality against the number of required policies in the cache.

However, unless the regions are very loosely connected or ε is very large, the number

of policies necessary to construct the cache may be prohibitive.

Imposing a hierarchy and hierarchical execution usually constrains the policies in

some way that prevents global optimality guarantees. Is is then desirable to achieve

the best value for all states consistent with the constraints of the hierarchy. A policy

that meets these criteria is called a hierarchically optimal policy. It is easy to see

that a hierarchically optimal policy is not necessarily globally optimal. For example,

when sub-MDPs in the hierarchy are invoked they will execute until termination,

even though, due to stochastic actions, it may be better to switch to a different

strategy.

For MAXQ, in which sub-MDPs are reused, Dietterich (2000) defines a weaker

form of optimality called recursive optimality. A recursively optimal policy is one

in which each sub-MDP is solved with defined values for its termination states

(Parr calls them out-states) without considering the context of the sub-MDP in the

overall problem. In MAXQ, for stochastic shortest path problems, Dietterich (2000)

equates these values to zero for all desirable (goal) terminations and an arbitrary

large negative number for undesirable terminations. Dietterich then reuses the policy

learnt for these termination values in other situations where the same goal states

are desired, but where they may have different values. While this has the advantage

of reuse, MAXQ sub-task policies may not be optimal in their different contexts.

MAXQ is therefore not necessarily hierarchically optimal, meaning that the overall


S1

S3

S2b

c

a

b

b

a

a

0.5

0.5

r = -1r = -1

r = -10

r = -10

r = -100

r = -100

Figure 4.9: A region with two exits, where the HEXQ decomposition misses a po-tentially low cost exit policy from the region.

solution may not be the best one given just the constraints of the hierarchy.

Optimality for a HEXQ decomposition can be demonstrated to be arbitrarily

worse than a globally optimal policy, as the example in figures 4.9 and 4.10 shows.

Figure 4.9 illustrates a region of three states, s1, s2 and s3 together with the tran-

sitions and rewards between them. The vertical line indicates the region boundary.

There are two possible exits, one from state s2 and the other from state s3, that lead

out of the region on action b. The entry state is assumed to be s1. The actions are

deterministic, except for action a from state s1 that has a 50% probability of moving

to either state s2 or s3. The generated HEXQ sub-MDPs for this region will provide

one policy for leaving the region via each exit, but not both as shown in figure 4.10

(a) and (b). The optimum value of state s1 is −10 taking optimum action b or c

depending on the exit goal. While action a may reach the exit state with less cost,

there is a 50% risk that the wrong exit state is reached and a heavy penalty is in-

curred (a reward of −100) to recover, i.e. return to state s1. If the globally optimal

policy did not favor either exit, then a policy that includes taking action a in state

s1 improves the value of state s1 from -10 to -1. This policy, shown in figure 4.10

(c), is unavailable to HEXQ by construction and therefore HEXQ may not produce


S1

S3

S2

a

b

b0.5

0.5

r = -1

r = -1

S1

S3

S2

cb

r = -10

S1

S3

S2b

br = -10

(a)(b)

(c)

Figure 4.10: For the region in figure 4.9 the optimal policies for the two sub-MDPscreated by HEXQ (one for each exit) are shown in (a) and (b). The optimal policyto use either exit, shown in (c), is not available to HEXQ by construction.

a globally optimal solution. Depending on the magnitude of the rewards in figure

4.9 the HEXQ value function may be arbitrarily worse than optimal.

Parr (1998) makes the same point about HAMQ learning which he proves to be

hierarchically optimal. An MDP combined with a HAM machine will not necessarily

find the optimal solution to the MDP. While the solution will be consistent with the

constraints imposed by the underlying machine specification, there is no guarantee

that the resultant HAM will produce a solution that is close to optimal (Parr, 2002).

Hierarchical execution means that sub-MDPs are executed until termination.

Hauskrecht et al. (1998) point out that in this case a sub-MDP will stubbornly

attempt to exit a region by the exit determined from the level above, even though

the agent may have slipped closer to another exit that is now more optimal. Di-

etterich generalises Kaelbling’s (1993) ideas showing that a possible improvement

to a recursively optimal solution is to execute it non-hierarchically and re-evaluate

the optimal policy at every level after every step. Dietterich refers to this as a hi-

erarchical greedy policy and shows that it is no worse than the recursively optimal

policy. However, even a hierarchical greedy policy cannot provide any optimality


guarantees. The example in figures 4.9 and 4.10 makes the point. The globally

optimal policy may be to take action a from s1, however, the HEXQ tree does not

contain this possibility as all sub-MDPs have been optimized to terminate in only

one exit.

Theorem 4.6 The recursively optimal solution of a HEXQ decomposition is hier-

archically optimal.

Proof All sub-MDPs below the top level have only one exit state and the discount

factor is 1. Therefore an optimal policy for each sub-MDP is optimal in all con-

texts as the value of the state reached after exit cannot affect the optimal policy

for the sub-MDP. The top level sub-MDP does not have a context (there are no

other higher level variables) and is solved with the unique termination values from

the original MDP. Therefore, a recursively optimal solution is the best that can be

achieved given the hierarchical constraints and is therefore hierarchically optimal.

The hierarchically optimal value and exit value of state s in a HEXQ decompo-

sition follow directly from equations 2.18, 4.7 and 4.6 as follows:

V ∗m(s) = max

a[V ∗

ma(s) + E∗

m(s, a)] (4.9)

E∗m(s, a) =

∑

s′T a

sas′ [Rasas′ + V ∗

m(s′)] (4.10)

4.3.1 Globally Optimal HEXQ

The solution to a HEXQ decomposed MDP can be shown to be globally optimal in

some cases.

Parr (1998) introduced the concept of a policy cache being ε-optimal. Knowing

that a policy cache is ε-optimal provides a guarantee that the global solution is


within a set bound of optimal. In particular, a corollary from Parr’s theorem 16

(Parr, 1998) is that if the policy caches for all regions of an MDP are 0-optimal, then

the optimal policy for the semi-MDP, using the policies from the caches as abstract

actions, is globally optimal.

Lemma 4.7 For a deterministic shortest path MDP (one with deterministic ac-

tions) the optimal policies for the sub-MDPs defined over HEXQ regions can be used

to implicitly generate 0-optimal policy caches for each region for any set of termi-

nation state values.

Proof By construction, each region has an associated set of sub-MDPs, one leading

to each region exit state. Define termination states V o(s′) as the states reached

following the execution of an exit. Assume an arbitrary set of value function values

for the termination states. The optimal value of state s is given by

V ∗(s) = maxπ

E{to exit∑n=1

rn + V o(s′)}. (4.11)

Since the actions are deterministic an optimal policy for the MDP will determine

the region’s exit. The search for the optimum value for state s can therefore be

written as a search over all possible exits leading to termination states. Furthermore,

the optimum value to the exit is given by one of the HEXQ decomposition sub-MDPs

values for state s by construction. Hence

V ∗(s) = maxi∈Exits

{V ∗i (s) + Ri + V o(si)} (4.12)

where V ∗i (s) is the optimum expected value of state s in the sub-MDP leading to

exit i. Reward Ri is the expected reward on exiting via exit i and V o(si) is the value

of the termination state following exit.

Therefore, given a set of termination state values, the optimal abstract action to

take in state s is the optimal policy for the sub-MDP with exit i that maximises the


argument in the above equation. The best action can be determined for every state

in the region resulting in a policy that is 0-optimal for a particular set of termination

state values. In this way a 0-optimal policy cache can be constructed for any set of

out state values.

The hierarchical execution of a HEXQ decomposition for a deterministic shortest

path problem implicitly executes a 0-optimal policy cache since equation 4.12 has

the same form as equation 4.9.

Theorem 4.8 Hierarchical execution of a HEXQ decomposed deterministic shortest

path MDP is globally optimal.

Proof Proceeding by induction, if level i sub-MDPs are optimal they implicitly

generate a 0-optimal policy cache for the semi-MDP at level i + 1 by the above

lemma. By Parr’s result the MDP represented by the semi-MDP at level i + 1 is

therefore optimal. A leaf node sub-MDP has only base level states and primitive

actions and is therefore optimal.

Corollary 4.9 The hierarchical execution of a HEXQ decomposed stochastic short-

est path problem is globally optimal if the stochasticity is present at the top level

only.

Proof It is only necessary to note that for a semi-MDP to produce a globally

optimal solution for the original MDP by Parr’s theorem, it is not required to be

deterministic, just that each region of the original MDP needs to have a 0-optimal

policy cache. Since all sub-MDPs below the top level are deterministic, all regions

have a 0-optimal policy cache by lemma 4.7.


This section has shown that the HEXQ decomposition is hierarchically optimal,

but that no globally optimal guarantees can be made for HEXQ in general. For

decompositions that only have stochastic sub-MDPs at the top level and determin-

istic actions for sub-MDP at all other levels, HEXQ has been proven to provide a

globally optimal solution.

4.4 Representing HEXQ trees compactly

A major benefit of a good HEXQ decomposition is the significant savings in com-

putation time and storage at each level of the hierarchy. This section explains the

opportunities for the compact representation of a HEXQ decomposed MDP by ab-

stracting states. The approach is to show compaction for a 2-dimensional MDP and

then to recursively generalise the results for higher dimensions.

4.4.1 Markov Equivalent Regions (MERs)

Take an MDP with state s = (x, y). A HEXQ partition is likely to contain many

similar regions. The HEXQ decomposition ensures that each leaf sub-MDP in a

HEXQ tree is independent of the y variable and that intra-region non-exit transitions

between similar x states are identical. Many regions are in a sense equivalent and

the equivalence will be used to great advantage to compact the state space.

To describe this equivalence formally, a binary relation7 B on HEXQ partition

G = {g} is defined as follows:

giBgj if and only if x = x′ for some s = (x, y) ∈ gi, s′ = (x′, y′) ∈ gj (4.13)

The binary relation B is an equivalence relation8. The equivalence classes of B on

G, [gi], will be referred to as Markov equivalent regions (MERs) to highlight the

7see Appendix A8see Appendix A


0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

(0,0) (1,0) (2,0) (3,0) (4,0)

(5,0) (6,0) (7,0) (8,0) (9,0)

(10,0) (11,0) (12,0) (13,0) (14,0)

(15,0) (16,0) (17,0) (18,0) (19,0)

(20,0) (21,0) (22,0) (23,0) (24,0)

(0,1) (1,1) (2,1) (3,1) (4,1)

(5,1) (6,1) (7,1) (8,1) (9,1)

(10,1) (11,1) (12,1) (13,1) (14,1)

(15,1) (16,1) (17,1) (18,1) (19,1)

(20,1) (21,1) (22,1) (23,1) (24,1)

(0,2) (1,2) (2,2) (3,2) (4,2)

(5,2) (6,2) (7,2) (8,2) (9,2)

(10,2) (11,2) (12,2) (13,2) (14,2)

(15,2) (16,2) (17,2) (18,2) (19,2)

(20,2) (21,2) (22,2) (23,2) (24,2)

Goal

Figure 4.11: The maze HEXQ graph with sub-MDPs represented compactly

property that non-exit state and reward transitions between x values in equivalent

regions are the same for all y values.

It follows that all sub-MDPs with the same x variable value exit state in a

HEXQ tree are identical. This means that a Markov equivalent region exit policy

can be reused for all y variable values. Because of this equivalence, a HEXQ tree

is represented compactly by redirecting edges to similar sub-MDPs at each level,

thereby converting the tree into a DAG, called a HEXQ Graph.

The savings in storage depends on the number of elements in each of the Markov

equivalent regions. The HEXQ tree for the maze problem in figure 4.6 can be

compactly represented as shown in figure 4.11 by eliminating 8 of the sub-MDPs in

figure 4.6 as they are clearly duplicates. In this example there are 3 region elements

in the one room Markov equivalent region class and the storage is reduced to one

third for the leaf node sub-MDPs.

Since all Markov equivalent region actions are hypothesized to be available for


each region instance the agent will explore them as abstract actions in the top-level

semi-MDP. For example, starting in state (3, 0) exploratory exit attempts will be

made at (state (2, 0), action North), ((10, 0), West), ((22, 0), South) and ((14, 0),

East). Of course the first two exits will be found to be sub-optimal for all starting

states in the room as the exits simply return the agent to the respective exit state

with a reward of −1.

4.4.2 State Abstracting Markov Equivalent Regions

Allowing for the possibility of starting in any base-level state, the top level semi-

MDP would still require as many E values as the original MDP would require Q

values. This is in addition to values stored to represent the abstract actions as

sub-MDPs policies.

It is however possible to compactly represent the semi-MDP E table for any

fixed policy over the HEXQ Graph. From theorem 4.4 the next state on exit and

the expected reward on exit are independent of the entry state. This independence

means that the HEXQ action value function is not dependent on any particular state

s ∈ g and can be written as

Eπm(g, a) =

∑

s′T a

gs′ [Rags′ + V π

m(s′)] (4.14)

where g is the aggregate state containing state s, T ags′ is the probability of exiting

in s′ after taking action a from any state in g. Rags′ is the expected reward for this

exit transition. It is now only necessary to store the HEXQ function Eπm(g, a) once

instead of for all s ∈ g. The number of aggregate states |G| is less than the number

of total states |S| by a factor of |S|/|G| and represents the storage savings factor for

this type of state abstraction at each level.

For the HEXQ graph in figure 4.6 abstracting whole Markov equivalent regions

to single states, makes it possible to collapse the 75 states in the top level semi-MDP


Goal

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1

2

Figure 4.12: The simple maze HEXQ graph with top level sub-MDP abstracted.

into just three abstract states as depicted in figure 4.12.

4.4.3 Compaction of Higher Dimensional MDPs

State abstraction for 2-dimensional MDPs can be recursively applied for d-dimensional

MDPs.

The approach is similar to that used for generating HEXQ partitions in section

4.2.4. The compaction is first applied to an equivalent two state MDP where the

state is represented as s = (s1× s2× . . .× sd−1, sd). Each sub-MDP in the resultant

HEXQ graph has one less variable in its state description and can be described with

the two state variables s = (s1 × s2 × . . . × sd−2, sd−1). MERs are now found at

level d− 1 i.e. for projected states x = s1 × s2 × . . .× sd−2. Common policies over

the MERs from all sub-MDPs are stored only once. MERs are state abstracted

and represented by single higher level states in each sub-MDP at level d − 1. This

process is recursively applied down the HEXQ tree. The end result is a compact


LIFT

1st Floor

Ground Floor

Exit Doors

Room 0

Room 2 Room 3

Room 1

Room 2

Room 1Room 0

Figure 4.13: The multi-floor maze

HEXQ graph.

To illustrate the compaction of a higher dimensional MDP the maze example is

extended in figure 4.13 to include another floor of rooms. A first floor with four

identically sized rooms is added and linked by a lift. Exiting the north-east room to

the east on each floor transports the agent to the same room location on the other

floor. The rooms are identified in the same manner on each floor. A position in this

“house” is specified by the three state variables; floor, room and location-in-room.

Figure 4.14 and 4.15 show the HEXQ tree and compact HEXQ graph respectively.

Decomposition equations 4.9 and 4.10 can be restated in compact form as follows:

V ∗m(s) = max

a[V ∗

ma(s) + E∗

m(g, a)] (4.15)

E∗m(g, a) =

∑

s′T a

gs′ [Rags′ + V ∗

m(s′)] (4.16)

They give the compact decomposed hierarchically optimal value function for

state s in sub-MDP m at level e, where (abstract) action a invokes sub-MDP ma

defined over region g. At the lowest level V ∗ma

= 0, as all primitive actions are exit


Figure 4.14: The HEXQ tree of sub-MDPs generated from the multi-floor maze

Figure 4.15: The HEXQ directed acyclic graph of sub-MDPs derived from the HEXQtree


actions and do not invoke lower level routines.

The implicit state abstraction from s to g can be seen as an application of

Dietterich’s result distribution irrelevance condition for permitting state abstraction.

The Markov Equivalent Region compact representation in HEXQ is related to Max

node irrelevance and leaf irrelevance in MAXQ and is similar to model minimisation

(Dean and Givan, 1997) over the MER sub-spaces.

Ravindran and Barto (2002, 2003) have defined homomorphic mappings of sym-

metric sub-MDPs. This would allow the 4 sub-MDPs in figure 4.12 to be compacted

further into one “relativised” option, in their terminology. Automating symmetry

discovery may be a useful future research direction.

In summary, this Chapter has shown how a stochastic shortest path MDP can

be HEXQ decomposed and represented compactly by abstracting base-level states

and primitive actions. The decomposition results in a multi-level hierarchy of sub-

MDPs that, in general, are all semi-MDPs, constrained to use abstract actions that

are fixed policies for exiting sub-MDP regions. At each level of the HEXQ graph

there is substantial opportunity for compaction, as all the sub-MDPs are formulated

to be independent of variables associated with higher levels. In addition, termination

of each sub-MDP is independent of how the sub-MDP is entered and hence, the value

following exit can be represented using just one table entry for a whole region of

aggregate states.

The next Chapter will describe the HEXQ algorithm, automating the decompo-

sition and solution process of an MDP in practice.

Chapter 5

Automatic Decomposition: The

HEXQ Algorithm

This Chapter describes the basic hierarchical reinforcement learning algorithm HEXQ.

It is assumed that a multi-dimensional MDP from the class of stochastic shortest

path problems has been provided. The objective is to decompose and solve the

problem automatically when the model (the state transition and reward functions)

of the original MDP is not provided to the algorithm. The hope is to solve the

original problem more efficiently, that is, with less storage requirements and in less

time.

The steps of the HEXQ algorithm are outlined in table 5.11 based on the theory

from the previous Chapter. In contrast to the treatment in the theory, the algorithm

builds the HEXQ graph from the bottom up starting with the leaf sub-MDPs. The

rest of this section gives a broad overview of the algorithm using the maze example.

The following sections will then give a more detailed account.

The choice of representation of the original problem plays a significant role in the

decomposition. For the simple maze example in the top of figure 5.1 the state space

is assumed to be described by the two variables: position-in-room and room-number.

1The notation to describe the state vector for the original MDP has been changed from s to xto reserve the variables s to describe aggregate states in the HEXQ decomposition.

76

Automatic Decomposition: The HEXQ Algorithm 77

Table 5.1: The HEXQ algorithm

function HEXQ( MDP〈states = X, actions = A〉 )

1. X ← sequence of variables 〈X1, X2, . . . , Xd〉 sorted by frequency of change

2. S1 ← X1

3. A1 ← A

4. for level e ← 1 to d− 1

5. explore Se, Ae transitions at random to find T ass′ , Exits(Se), Entries(Se)

6. MERe ← Regions(Se, Ae, T ass′ , Exits(Se), Entries(Se))

7. construct sub-MDPs from MERe using Exits(Se)

8. for all sub-MDPs m

9. E∗m(se, ae) ← V alueIteration(sub-MDP m, γ = 1 )

10. Ae+1 = ∪i∈MEReExits(i)

11. Se+1 = MERe ×Xe+1

12. Execute(level d, initial state, top level sub-MDP)

end HEXQ

The sub-task of intra-room navigation, using the position-in-room variable, does not

depend on the room-number variable value. In this sense rooms are regions invariant

of the specific room number. HEXQ tries to find sub-space regions, one variable at a

time, for which the state transition and reward functions are invariant in all contexts

defined by the values of the remaining variables. This invariance is exploited later,

as policies learnt over these regions are reused in different contexts. The individual

state variables are used to construct the different levels of the hierarchy. There are

as many levels in the hierarchy as there are variables in the overall state description2.

The algorithm starts by sorting the variables of the problem by frequency of

change (line 1), as frequently changing variables are likely to be associated with

task at lower levels. Since the room-number changes at less frequent time intervals

the algorithm explores the behaviour of the position-in-room variable first (lines 2

2section 5.7 describes the conditions under which levels may be collapsed to optimise storagerequirements.


0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

(0,0) (1,0) (2,0) (3,0) (4,0)

(5,0) (6,0) (7,0) (8,0) (9,0)

(10,0) (11,0) (12,0) (13,0) (14,0)

(15,0) (16,0) (17,0) (18,0) (19,0)

(20,0) (21,0) (22,0) (23,0) (24,0)

(0,1) (1,1) (2,1) (3,1) (4,1)

(5,1) (6,1) (7,1) (8,1) (9,1)

(10,1) (11,1) (12,1) (13,1) (14,1)

(15,1) (16,1) (17,1) (18,1) (19,1)

(20,1) (21,1) (22,1) (23,1) (24,1)

(0,2) (1,2) (2,2) (3,2) (4,2)

(5,2) (6,2) (7,2) (8,2) (9,2)

(10,2) (11,2) (12,2) (13,2) (14,2)

(15,2) (16,2) (17,2) (18,2) (19,2)

(20,2) (21,2) (22,2) (23,2) (24,2)

Goal

Figure 5.1: The simple maze example introduced previously. The invariant sub-space regions are the rooms. The lower half shows four sub-MDPs, one for eachpossible room exit. The numbers are the position-in-room variable values.

to 5). The state space that the algorithm considers at this stage is the total state

space projected onto the position-in-room variable values. The transition and reward

model for this projection is explored using primitive actions. The room region (the

MER in line 6) is discovered by HEXQ by finding state transitions and associated

rewards that are invariant of room number and linking them together to form a

contiguous state-action space. Some transitions are discovered not to be invariant

of room-number. For example, moving North from position-in-room 2 transitions to

position-in-room 22 when in room 2 but position-in-room 2 when in room 0. These

unpredictable transitions are flagged as exits. In this example exits correspond to

potential ways of leaving a room via doorways but they may have a more abstract

interpretation in other problems.

Having identified a typical room region, the only motivation an agent in a room

can have is to leave it. The reason is that all immediate rewards for transitions inside

the room are negative by definition of a stochastic shortest path problem and a policy

to stay inside a room indefinitely will not be optimal. The algorithm constructs

separate sub-MDPs (line 7) that learn the value function and a policy (lines 8 and 9)


to exit a room via every possible exit. These sub-MDPs are illustrated on the bottom

of figure 5.1. Note that by projecting the total state space onto the position-in-room

variable the original representation has allowed the three rooms to be represented

as one generic room region, implicitly performing the state abstraction to this one

Markov equivalent region (MER) as described in section 4.4.1.

The learnt policy to leave a room by one of the exits can be invoked anywhere

in a room and has the effect of moving the agent out of the room. This represents

a temporally extended or abstract action. From the viewpoint of an agent that can

only sense the room it is in, performing abstract room leaving actions is all that is

necessary for it to solve the maze problem. It can learn to choose the right policy

of abstract actions to navigate from room to room and reach its goal. This abstract

problem is a semi-MDP, its abstract actions are policies to exit each room (line 10)

and its abstract states are the individual rooms using the room-number variable

(line 11).

To decompose MDPs with more than two variables the algorithm will search for

abstract regions in the abstract problem, iteratively constructing the HEXQ graph

level by level using the for-loop in lines 4 to 11.

The top level will always be just one semi-MDP that solves the original problem.

It is invoked in line 12 of the algorithm. The top level semi-MDP invokes a cascade

of abstract actions along the path down the HEXQ graph. The actions at the lowest

level are the primitive actions. In the maze example the agent will first decide on a

room leaving action that in turn invokes a lower level sub-MDP controller to execute

a sequence of primitive actions that cause the agent to leave the room.

The steps of the HEXQ algorithm are now described in greater detail.


5.1 Variable ordering heuristics

The basis of HEXQ is to discover sub-tasks that can be learnt once and then reused

multiple times in different contexts. If repetitive sub-space regions can be found in

which the agent can learn to perform tasks without reference to their wider context,

then this invariance is useful in reducing the learning effort. Skills invariant to

context need only be learnt once and then transferred and used in other contexts.

These skills can also be used in contexts unseen by the learner, providing HEXQ

with the ability to generalise.

Repetitive tasks that are performed with the highest frequency tend be the build-

ing blocks for more complex tasks. Pressing a button is a basic skill that can be

used to dial a telephone number, switch on lights or start a microwave oven. Dialing

a telephone number is a repetitive skill that in turn can be used to call a taxi, book

a flight, talk to friends, etc. To decompose a multi-dimensional state space it makes

sense to order the variables by frequency of change. In this way the variables that

change less frequently can provide a context for the more frequently changing ones.

The telephone exchange will not change state for the duration of the hand-eye co-

ordination task in pushing the button. The booking office phone will not ring until

all the numbers have been dialled. The flight will stay in an un-booked state until

the telephone call is complete.

In a similar manner subroutines in computer programs that execute most fre-

quently are associated with the lowest levels of execution. Variables that change

value more frequently should be associated with the lower levels of the hierarchy.

Conversely, variables that change less frequently set the context for the more fre-

quently changing ones. This is the intuition behind the heuristic to order variables3.

While it is not a requirement that variables must change values at different time

scales for the original MDP, it is a characteristic required to produce useful HEXQ

3Future research that dispenses with this heuristic is proposed in section 9.3


decompositions.

When searching for invariance, variables that remain constant for longer periods

of time are likely to set a more durative context for the faster changing variables.

Hence sub-space regions are formed first in HEXQ by variables that change more

frequently.

The first step is to order the variables by frequency of change. To order the

variables, the agent is allowed to explore its environment, at random, for a period

of time and statistics are kept on the frequency of change of each of the state

variables. The appropriate amount of exploration is user determined at this stage.

The variables are then sorted based on their frequency of change. The hierarchy

is constructed from the bottom up starting with the variable that changes most

frequently. Line 1 in the HEXQ algorithm in table 5.1 orders the variables.

For the simple maze example (figure 5.1), table 5.2 shows the frequency that

each variable changed value during a 2000 random action exploration run. The

position in room variable changes value more frequently than the room number

variable. The order in which the variables are therefore processed to build the

hierarchy is: X1 = position-in-room and then X2 = room-number.

HEXQ numbers the levels in the hierarchy from the bottom starting at level 1.

Lines 2 and 3 of the algorithm initialise the variable at level 1 to be the variable

that changes most frequently and the actions to be the primitive actions. By only

considering the values of one variable, all the states of the original MDP are ef-

fectively projected onto this variable. These projected states are highly aliased, in

that, the values of all the other variables are not specifically referenced. For the

maze in figure 5.1 S1 = {0, 1, 2, . . . , 24} and A1 = {North, South,East, West}.


Table 5.2: Frequency of change for the rooms example variables over 2000 randomsteps.

Variable Frequency Order

Room number 21 2

Position in Room 1631 1

5.2 Finding Markov Equivalent Regions

The aim is to find connected regions for which the transition and reward functions

for the projected state space are invariant in the context of all the less frequently

changing variables.

In line 5 the HEXQ algorithm explores the state and reward transitions for the

projected variable S1 = X1, initially for level one, but in general for level e, by taking

random actions. The amount of exploration is provided by the user and specified

as the number of times all projected state action pairs are executed. The amount

of exploration needs to be adequate to allow HEXQ to find all exits. The amount

of exploration is currently found by trial and error (but see section 9.1 on how this

may be developed in the future). Statistics on all projected state transitions and

rewards are recorded.

Line 6 partitions the projected state space in accordance with the HEXQ parti-

tion definition 4.2. For the maze, the projected state space and transitions are shown

in figure 5.2. Four of the transitions are exits because they have unpredictable out-

comes from the point of view of the position-in-room variable. The transitions are

unpredictable for multiple reasons: the problem may terminate, the room-number

may change or the transition is not stationary.

To see how the partitioning process relates to Chapter 4, the global states are

defined as (se, y) where se is the projected state and y = xe+1×xe+2 . . .×xd. Initially

e = 1 and se = x1.


0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

Exit

Exit

Exit Exit

Figure 5.2: The Markov Equivalent region for the maze example showing the fourpossible exits

To find the Markov equivalent regions (MERs), an action labelled directed graph

is drawn in which the nodes are the state values of variable se only and the edges are

the non-exit transitions that have some probability of occurring. Edges associated

with exits are not represented in the graph and strongly connected components in

the resultant graph are used to form the MERs.

The one MER for the maze example in figure 5.1 (a) is shown in figure 5.2 as a

directed graph. The exits indicate transitions for which the room-number variable

may change value. They are:


(s1 = 2, a1 = move-north)

(s1 = 10, a1 = move-west)

(s1 = 14, a1 = move-east),

(s1 = 22, a1 = move-south)

The states of the graph are strongly connected. The agent can be initialized

at any position-in-room value, hence all states are entry states. It is easy to verify

visually that all entry states can reach all exit states in this region.

Exits must be discovered to allow MERs to be formed. The algorithmic details

of how exits are discovered in general will be discussed next.

5.2.1 Discovering Exits

The definition of an exit is developed in section 4.1). Viewed from the perspective

of the projected state variable se, a state action pair is an exit if:

1. The MDP terminates.

2. The context changes (i.e. any higher level variable changes value).

3. The transition is between different Markov equivalent regions.

4. The transition function is not stationary.

5. The reward function is not stationary.

6. if (se, a) is an exit then it is an exit for all y.


Termination or Context Change

Exits are easily identified when a MDP terminates or the context (y variable value)

changes. In any transition from sei to se

j following action a, if any of the variables

xe+1, . . . , xd change value or the MDP terminates, then (se, a) is an exit.

Non-stationary Transitions

If transitions or reward functions vary for different values of y then they are variant

and associated exits are declared. By focusing on variable se the state space of

the total MDP has effectively been projected onto this variable. Transitions in this

projection are highly aliased in that they represent transitions for any value of the

y variable. It is possible that the probability of transition or the expected reward is

conditionally dependent on the y variable values. If this happens then the projected

transition will not be stationary (invariant of the y value) and will give rise to an

exit. Exits can be discovered by testing the transition and reward probabilities in

the context of different y variable values.

To illustrate how this may arise, figure 5.3 shows the four room example in which

there is a corner wall obstacle placed in room 0 and the reward on transition from

state (2, 0) to state (3, 0) is −10 instead of the usual −1. The probability of some

transitions are now no longer stationary from the perspective of the position-in-room

variable. A procedure is required for finding these non-stationary transitions.

For deterministic transitions it is easy to determine the non-stationary transitions

for projected states. It is only necessary to find a transition to two different next

states or with two different reward values to trigger an exit condition. In the example

in figure 5.3 the transition from state (2, 0) to state (3, 0) in room 0 has a reward

of −10 in contrast to the reward received for transitions from (2, 1) to (3, 1) or

(2, 2) to (3, 2) in rooms 1 and 2 respectively which have an associated reward of −1.

Therefore the transition from state e1 = 2 to e1 = 3 is not deterministic with respect

to reward and (e1 = 2, action = East) becomes an exit. In the case of a transition


(0,0) (1,0)(2,0) (3,0)

(4,0)

(5,0) (6,0) (7,0) (8,0) (9,0)

(10,0) (11,0) (12,0) (13,0) (14,0)

(15,0) (16,0) (17,0) (18,0) (19,0)

(20,0) (21,0) (22,0) (23,0) (24,0)

(0,1) (1,1) (2,1) (3,1) (4,1)

(5,1) (6,1) (7,1) (8,1) (9,1)

(10,1) (11,1) (12,1) (13,1) (14,1)

(15,1) (16,1) (17,1) (18,1) (19,1)

(20,1) (21,1) (22,1) (23,1) (24,1)

(0,2) (1,2) (2,2) (3,2) (4,2)

(5,2) (6,2) (7,2) (8,2) (9,2)

(10,2) (11,2) (12,2) (13,2) (14,2)

(15,2) (16,2) (17,2) (18,2) (19,2)

(20,2) (21,2) (22,2) (23,2) (24,2)

Goal

-10

Figure 5.3: The maze example with a corner obstacle and an expensive transition inroom 0 giving rise to non-stationary transitions from the perspective of the location-in-room variable.

from e1 = 7 to e1 = 8 the transition fails in room 0 but not in the other rooms.

Again this transition is not deterministic. The pair (s1 = 7, action = East) is

recorded as an exit.

In the stochastic case testing for non-stationary transitions or rewards is more

involved. It is necessary to keep transition statistics and then test the hypothesis

that the reward and state transitions come from different probability distributions

for different values of the y variable. In theory it is possible to determine this to any

degree of accuracy given that it is possible to control the sample size and test all the

individual contexts represented by the y variable value. In practice this can become

intractable because the combinations of higher level variables and hence contexts

can grow exponentially.

Instead, a heuristic is employed to find non-stationary state transitions. The

transition statistics are recorded over a shorter period of time and compared to

their long term average. The objective is to explicitly test whether the probability


distribution is stationary. From the total sample space of each transition from state

sei following action a the probability p of reaching state se

j is calculated as p =

frequency(sei , a, se

j)/frequency(sei , a). The outcome of a sample of temporally close

transitions of the same type, sei , a, se

j , is recorded. Using a binomial distribution4

based on the average probability p, the likelihood, that this temporally close sample

has a different probability, can be tested (see section A.3). When this is the case,

(sei , a) is declared an exit. This test is performed multiple times in line 5 of the

HEXQ algorithm for each type of transition as new temporally close samples become

available. The procedure is to count the number n of times sei transitions to se

j on

action a in the temporal sample of N . If n < pN then the level of significance α is

α =n∑

i=0

N !

i!(N − i)!pi(1− p)N−i. (5.1)

If n ≥ pN then the sum is taken from n to N . The significance level was set at

0.0001% with N = 17 to trigger exits. This test may fail to perform in practice and

result in falsely identifying some transitions as exits and miss others. The penalty for

recognizing extra exits is simply to generate some additional overhead for HEXQ

as these exits may require new policies to be learnt and will require additional

exploration for the extra abstract actions in the next level up the hierarchy. This

does not detract from the quality of the solution in terms of optimality.

If important exits are missed, the solution may be of poorer quality or in the

worst case, fail altogether. Chapter 9 makes a number of suggestions to improve

exit discovery. At this stage the algorithm relies on an adequate exploration period

for this statistical test to find all relevant exits.

Another test is required to detect reward function non-stationarity. Unlike states,

rewards are continuous scalar values. To test the hypothesis that the rewards for

a transition from sei to se

j , following action a, are from the same distribution, the

4The Chi squared test has been used in some versions of HEXQ with similar intent


Kolmogorov-Smirnov (K-S) test (see A.3) is applied to two consecutive temporal

samples of rewards. If the test indicates that the transition rewards in HEXQ come

from different distributions the pair (sei , a) is declared an exit.

5.2.2 Forming Regions

A HEXQ partition requires that in every Markov equivalent region (MER) an agent

must be able to reach any exit state from any entry state, with probability one,

without executing exits. To ensure that this is possible, the HEXQ algorithm in line

6 first finds the strongly connected components (SCCs) of the underlying projected

state transition graph with the nodes connected only by non-exit transitions. SCCs

have the property that all nodes can be reached with probability one. SCCs are

combined into MERs if all resultant region entry states can reach all region exit

states. The pseudo code for finding MERs is provided in table 5.3 and uses the

algorithm to find strongly connected components from section A.2.

The linear time algorithm to find SCCs requires an adjacency matrix specifying

the directed edges between nodes (table 5.3 lines 1-3).

Definition 5.1 State s′ is adjacent to state s, written adj[s][s′] and an edge is

drawn from node s to node s′ if there exists an action a such that the probability of

transition from s to s′ following action a is greater than zero and the transition is

not an exit, i.e.

∃a T ass′ > 0 and (s,a) is not an exit =⇒ adj[s][s′] (5.2)

Under some stochastic conditions it is not possible to ensure that any exit can

be reached with probability one in a multi-state region such as a room. The random

nature of the transitions may make it possible that an agent will drift through an

unintended exit. To address this situation, the function Regions keeps partitioning

the MERs, right down to single state regions if necessary, to ensure the reachability


Table 5.3: Function Regions finds all the Markov Equivalent Regions (MERs) atlevel e given a directed graph for a state space Se, Exits(Se) and Entries(Se) sets

function Regions( states Se, actions Ae, T ass′ , Exits(Se), Entries(Se))

// Find SCCs so that all states are reachable from all others

1. repeat until number of SCCs do not increase

2. for each s, s′ ∈ Se and a ∈ Ae (where s transitions to s′ on action a)

3. if (T ass′ > 0 and {(s, a)} * Exits(Se)) adj[s][s′] ← true

4. else adj[s][s′] ← false

5. SCC( states Se, adj[s][s′] )

6. for each s, a, s′ connecting two SCCs

7. adj[s][s′] ← false

8. Exits(Se) ← Exits(Se) ∪ {(s, a)}9. Entry(Se) ← Entries(Se) ∪ {s′}

10. end repeat

// Join SCCs into regions so that all region entry states can reach all exit states

11. for each s, a, s′ connecting two SCCs

12. if (SCC[s] has no other Exit or SCC[s′] has no other Entry

13. Exits(Se) ← Exits(Se)− {(s, a)}14. Entry(S) ← Entries(Se)− {s′}15. for each si if(si = SCC[s′])

16. SCC[si] ← SCC[s]

17. return |SCCs|, SCC[·], Exits(Se), Entries(Se)

end Regions


(a)

(b) (c) (d)

Figure 5.4: All actions in this example are assumed to have some probability oftransitioning to adjacent states. Part (a) illustrates two such actions near doorways.Function Regions breaks a room iteratively into single state MERs. The results ofthe first three iterations are shown as parts (b), (c) and (d).

condition of each region is met. Figure 5.4 illustrates the effect of function Regions

when the room navigation actions have some probability of transitioning to any

adjacent state on any action. Under these circumstances function Regions breaks

the room into individual state MERs which passes the exit issue to higher levels in

the hierarchy to resolve, effectively reverting to the solution of the original MDP.

This type of partitioning is achieved by repeatedly calling lines 1-10 in function

Regions until no additional SCCs are generated, ensuring that all exits are proper

in a stochastic setting. The reasoning is as follows. After each function call to SCC,

the strongly connected components of any directed graph form a directed acyclic

graph (DAG) in which the nodes themselves are the components. The state and

action associated with each edge leaving a component becomes an additional exit

and the node associated with the entering edge becomes an additional entry. This

ensures that inter-block transitions give rise to exits, in other words, that transitions

between different MERs are exits. The edges associated with these new exits are

removed from the adjacency matrix of the underlying directed graph. Removing

links in the adjacency matrix may result in additional SCCs. The interdependence


of the exits and HEXQ partitions means that the algorithm to find SCCs must be

rerun every time new exits are introduced as the adjacency matrix is changed.

The process stops when each state in a SCC can be reached from any other state

with a probability greater than zero without being forced to exit. When a MER

consists of only one state, this state is both an entry and an exit state and the

condition is met trivially. A policy can therefore be created that can reach any state

with probability one without leaving the SCC. With deterministic actions, only one

pass is required to find SCCs as the removal of edges between SCCs will not remove

any edges within SCCs (as may be the case for stochastic actions).

The SCCs found in lines 1-10 of function Regions could be equated to MERs

that are later abstracted to form higher level states. For an efficient decomposition

the aim is to maximize the size of the MERs in order to minimize the number of

abstract states. SCC over-specify the partition requirements. The only requirement

is that all exit states be reachable from all entry states, not that all states can reach

each other.

It may be possible to combine some SCCs to form larger MERs. The SCCs

still form a DAG when the edges removed in lines 6 to 9 of function Regions are

reinstated. Two connected SCCs with edges say from scci to sccj can be joined if

scci has no other exits or sccj has no other entering edges. In lines 11-16 of function

Regions, table 5.3, SCCs are tested for combination opportunities with the proviso

that they do not violate exit reachability requirements. The Regions algorithm

returns the MERs as MERe = SCC[·] in line 17. A MER, therefore, is a single

SCC or a combination of SCCs such that any exit state in the region can be reached

from any entry with probability 1. When combining SCCs to form regions a mixture

of MERs and SCCs can be combined in the same way.

Figure 5.5 illustrates the combination process. It shows four SCCs labelled 0 to

3 connected as a DAG with external entry and exit edges. SCC scc0 and scc1 can

be joined to form MER0 because scc1 has no other entries other than the one from


scc0 scc1

scc2 scc3

MER0

MER1

Figure 5.5: Four SCCs joined to form two MERs

scc0. SCC scc2 can be joined with scc3 to form MER1 because scc2 has no other

exit edges besides the one to scc3. SCC scc0 cannot be joined to scc2, for example,

because scc0 has another exit edge (to scc1) and scc2 has another entry edge from

outside. MER0 cannot be combined with MER1 for similar reasons.

The repeated call on the SCC function means that in the worst case finding

MERs now takes O([|Se|+ |Ae|]2) time. This in itself is not an issue as the number

of states are only those abstracted at level e, that is, unless the number of states

grows exponentially because the overall problem will not decompose efficiently. The

MERs are arbitrarily labelled for convenience with consecutive integers starting from

0. This completes line 6 in function HEXQ in table 5.1.

5.3 Creating and Solving Region Sub-MDPs

Now that MERs have been found that are invariant in the context of all other

combination of higher level variable values, the HEXQ algorithm proceeds to find a

set of policies for these regions (lines 7 to 9). The only motivation an agent can have

in a region is to exit, as all non-exit transitions have a negative reward. One policy

is found to reach each exit in the MER. A policy to reach an exit state followed by


an exit action comprise an exit from the MER. The policies and their exit actions

are the abstract actions that are available at the next higher level in the hierarchy.

The rest of this section will explain how the cache of exit policies for each MER is

found by creating and solving sub-MDPs for each of the exits in each of the MERs.

One sub-MDP could be defined for each exit as described in section 4.2.2. How-

ever, if two or more exits have the same set of hierarchically nested exit states,

then it is not necessary to create multiple sub-MDP because once they are reached,

the different exits can be executed simply by executing their different primitive exit

actions. It may not be sufficient to just have one sub-MDP for each abstract exit

state. The issue only arises when there are more than two levels in the hierarchy.

Consider, for example, the nested set of abstract states in figure 5.6. The shaded

region in the diagram represents a MER at level 2. It contains 4 abstract level 2

states, S20 , S2

1 , S22 and S2

3 . Abstract state S21 is an aggregate of level 1 states S1

0 , S11

and S12 . There are two exits from the level 2 MER, both from exit state S2

1 . The

value function to exit the MER will be different for each of these two exits because

one of the exits requires an addition internal transition. Therefore two sub-MDPs

are required because the states S10 and S2

1 differs from the set S12 and S2

1 in the

hierarchy.

Definition 5.2 (Hierarchical State) For an MDP with state x ∈ X, the hierar-

chical state s for x at level e is the ordered sequence (s1, s2, . . . , se), where si ∈ Si

from step 11 in the HEXQ algorithm in table 5.1. Note the level e to which hierar-

chical state s is defined is always implied by the context in which it is used.

Abstract state si can be determined from x as will be seen later by equation

5.5. A hierarchical exit state at any level is the hierarchical state associated with

an exit. It is the sequence of states in an exit. The hierarchical exit state for the

exit (se, (se−1, (. . . (s1, a1) . . .))) at level e is (s1, s2, . . . , se). When hierarchical exit

states are the same for different exits, HEXQ only needs to construct one sub-MDP

for these exits.


a2

20s

23s

21s

22s

11s

12s

10s

a2

a2

a1 a1

b1

b1

Figure 5.6: Two exits of a level 2 MER that requires 2 separate level 2 sub-MDPseven though both exits have the same level 2 exit state, S2

1 .

The construction of sub-MDPs (line 7 of the HEXQ algorithm) proceeds as out-

lined in section 4.2.2. One sub-MDP is created for each hierarchical exit state of each

MER. Recall that exits from other hierarchical exit states are not allowed by con-

struction. This ensures that the policy cache contains policies that are guaranteed

to exit from only one exit state.

HEXQ has already gathered considerable statistics, modelling the state transi-

tion and reward functions in the previous partitioning steps. It is therefore efficient

to solve the sub-MDPs using dynamic programming. A dynamic programming al-

gorithm for solving the sub-MDPs is given in table 2.15.

The hierarchically optimal policy for these sub-MDPs can also be learnt or im-

proved by using temporal difference methods such as Q learning concurrently with

exploration at the next level. It is important to remember to restrict actions for

other exit states to non-exit actions during learning, even for exploration, as un-

intended exits at any level will mislead the learner. The HEXQ implementation

5The standard process is only complicated by the need to recursively calculate the value of thepost transition states s′ from the decomposed value function as per equation 4.8 and describedalgorithmically in subsection 5.4. This means that the value function in step 11 of table 2.1 becomesV (s′) ← value(e, sub-MDP m, s′). The rewards Ra

ss′ in table 2.1 are interpreted as the primitiverewards on exit and the returned function Q∗(s, a) is E∗

m(se, ae).


continues to update the E functions at all levels using Q-learning as shown later in

section 5.6. The dynamic programming solution is based on the sample transitions

that are explored and may only have provided an approximation to the true tran-

sition probability distributions. The benefit of continuous updating is to refine the

value function and policy.

In the simple maze example there is one MER and four exit states as illustrated

in figure 5.2. Consequently four sub-MDPs are created, one for reaching each of the

exit states. These sub-MDPs are as illustrated in the bottom of figure 5.1. This

completes lines 8 & 9 of the HEXQ algorithm in table 5.1.

5.4 Hierarchical State Value

HEXQ requires the optimal value of a hierarchical state to be reconstructed from

the decomposition equations 4.9 and 4.10. The pseudo code for this procedure for

hierarchical state s at level e is given in table 5.4 and is similar to EvaluateMaxNode

(Dietterich, 2000).

The hierarchical state value is required by HEXQ in line 9 in table 5.1 and

to update the HEXQ function E during temporal difference learning. When the

hierarchical state is the current state of the agent, the next optimal (greedy) action

to take at all levels in the hierarchy is calculated as a byproduct of this procedure.

Function value implements a depth-first-search as in MAXQ. A depth-first-search

can become expensive (Dietterich, 2000), as the branching on abstract actions means

an exponentially increasing number of paths to search with the number of levels.

Chapter 8 addresses ways to contain the search.

5.5 State and Action Abstraction

Once a policy cache has been established for each MER it is possible to reformulate

and reduce the original MDP by eliminating a variable. In the maze example, once


Table 5.4: Procedure for evaluating the optimal value of a hierarchical state in aHEXQ graph. The function returns the optimal value of the hierarchical state sbased on the optimal policy for each sub-MDP and finds the best greedy action atevery level up to e

function value( level e; sub-MDP m; hierarchical state s )

1. if e = 0 return 0

2. A ← actions available in m

3. v ← −∞4. ae∗ ← undefined (best greedy action at level e)

5. for each a ∈ A

6. me−1 ← sub-MDP associated with a at level e− 1

7. v′ ← value(e− 1, me−1, s)+E∗m(se, a)

8. if (v′ > v)

9. v ← v′

10. ae∗ ← a

11. return v, ae∗

end value

room leaving policies are established, the position-in-room variable can be eliminated

and a reduced semi-MDP defined using the room-number as the state and the room

leaving policies as actions, albeit temporally extended. This was shown in figure

4.12 Chapter 4.

When there is more than one MER at the next level, projected states may consist

of all possible combinations of MER types and values of the next variable in the

frequency order. Lines 10 and 11 of the HEXQ algorithm prepare the abstract states

and actions for the semi-MDP at the next level.

Abstract actions at the next higher level are policies for exiting regions at the

current level. They are the optimal policies to exit the sub-MDPs. The notation

that is used to describe an abstract action at level e + 1 is (se, ae) where se is an

exit state and ae is an (abstract) exit action at level e. This is the same notation to

that used to describe exits at level e. There is a 1 : 1 correspondence between all


the exits at one level and the abstract actions at the level above. Given the set of

MERs, MERe, at level e, the set of all abstract actions Ae+1 at level e + 1 is given

by

Ae+1 =⋃

i∈MERe

{Exits(i)}. (5.3)

Taking abstract action (se, ae) at level e + 1 means that the agent is required

to use the optimal policy for the right sub-MDP at level e to move to exit state se

and then execute action ae. In the maze example the set of four abstract actions is

A2 = { (s1 = 2, a = move-north), (s1 = 10, a = move-west), (s1 = 14, a = move-

east), (s1 = 22, a = move-south) }. The taxi example of section 6.1.2 illustrates a

case where the number of abstract actions is greater than the number of sub-MDPs

required.

Abstract projected states at the next level Se+1 are formed by the cartesian

product of the set of MER labels for the current level e and the set of values for the

next state variable in the ordering Xe+1.

Se+1 = MERe ×Xe+1 (5.4)

For convenience, to easily index arrays in the code, the numerical method used

to determine an abstract variable value se+1 for state (se, xe+1 × . . .× xd) is

se+1 = |MERe|xe+1 + me

where

|MERe| = number of MERs at level e

me = MER label for level e (see section 5.2.2) such that se ∈ MER.

(5.5)

It is generally the case that different abstract states have different abstract ac-

tions. The semi-MDP that is defined for the next level has one less variable than


the previous one and uses only abstract actions. The procedure (lines 5 to 11 in

function HEXQ table 5.1) is repeated, finding MERs and exits using the abstract

states and actions at the next level. In this way, one level of hierarchy is generated

for each variable in the original MDP. When the last state variable is reached, the

top level sub-MDP, represented by the final abstract states and actions, solves the

overall MDP.

5.6 HEXQ Execution

The algorithm for executing HEXQ at any level, including the top level, is given

in table 5.5. It executes the HEXQ hierarchy, recursively calling itself as it invokes

lower level policies. At each level it remembers the abstract state. On return,

following the execution of an abstract or primitive action, it updates the E value

using Q-Learning as per table 2.2.

Hence, the final step in the HEXQ algorithm is to Execute(top level, starting

state, 0). There is only one sub-MDP at the top level that is invoked with abstract

action 0. When function Execute returns, the problem has terminated and can be

restarted by re-initialising the problem according to the starting state distribution

X0 and calling Execute again6.

5.7 Efficiency Improvements

There are a number of ways that HEXQ can be improved. The following two ideas

have not been implemented at this time.

6Some refinements have not been included in the pseudo code. The exit for the top level sub-MDP may have reward on termination, depending on the problem. A condition to stop the learningprocess could be included.


Table 5.5: Function Execute solves sub-MDP m associated with abstract action aat level e. The state s is the current hierarchical state at each level depending oncontext in which it is used. lse is the last projected state at level e. The learningrate β is set to a constant value. The E tables are originally initialised to 0.

function Execute(level e, state s, action a)

1. if (e = 0)

2. execute primitive action a and observe s and reward

3. return

4. m ← sub-MDP associated with action a

5. repeat

6. act ← value( level e, sub-MDP m, state s) or exploration policy

7. lse ← abstraction of s at level e

8. Execute(level e− 1, state s, action act)

9. if((s, act)=exit) return

10. else

11. V ′ ← value( level e, sub-MDP m, state s)

12. Em(lse, act) = (1− β)Em(lse, act) + β(reward + V ′)

end Execute

5.7.1 No-effect Actions

If at level e an action a always leaves the hierarchical state s unchanged then no

HEXQ value function entries E(se, a) for this state action pair need be stored. For

stochastic shortest path problems since the rewards are always negative, a hierarchi-

cal state transition to itself with probability one will never be a part of an optimal

policy. This action can therefore be safely eliminated from being available in such

a state. The savings in storage and learning time can be significant, as HEXQ will

otherwise explore and store values for all allowable actions.

If the action causes an exit in some states then exits are still noted, but the value

function E is by definition zero and does not need to be stored for any exit actions,

except at the top level if there is a final reward on exit.


5.7.2 Combining Levels

The HEXQ algorithm described in this chapter builds one level of hierarchy for each

variable in the state description. It may be the case that a sub-set of variables are so

interdependent that the MERs are too small to provide an effective decomposition.

When the MERs are individual states no state abstraction is possible and if all the

primitive actions remain as exits then there is no advantage in introducing another

level in the hierarchy. If this is the case, it is more efficient to combine the variables

by taking their Cartesian product directly and skipping a level in the hierarchy.

The decision to combine variables directly can be algorithmically determined if, for

example, the storage space for the action-value function E is the criteria.

Assume the states at level e are se and the next variable to be processed is xe+1.

Say the ith region has |sei | states, the number of actions per state is |ae

i |, the number

of exits is |eei | and the number of sub-MDPs required is |me

i |.The storage requirement with separate levels in the HEXQ hierarchy is:

StwoLevels =∑

i

(|sei |.|ae

i |.|mei |) + |xe+1|

∑i

|eei | (5.6)

If the variables se and xe+1 are combined, then the storage requirement is:

ScombinedLevel = |xe+1|.∑

i

(|sei |.|ae

i |) (5.7)

The decision to combine variables would be triggered when ScombinedLevel <

StwoLevels. These calculations do not include no-effect actions. An examination

of the equations shows why HEXQ has the potential to improve the space complex-

ity. In StwoLevels the number of values in each of the two state variables appear in

separate terms that are added. For the combined storage the size of the two state

spaces are multiplied.


5.8 Conclusion

This chapter explained the operation of the HEXQ algorithm in detail. HEXQ

automatically decomposes and solves a stochastic shortest path multi-dimensional

MDP. It does this by eliminating one variable at a time, searching for invariant

Markov equivalent regions and collapsing the regions into abstract states. Each

region is equipped with its own cache of intra-region policies. The policies are

abstracted to form abstract actions. The abstract states and actions form a new

semi-MDP with one less variable. Repeated abstractions hierarchically abstract the

original MDP. The next chapter will test HEXQ in a variety of different domains.

Chapter 6

Empirical Evaluation

This chapter will show how HEXQ performs in a number of different domains. The

first one is the taxi task. This example was created by Dietterich (2000) to illustrate

MAXQ. For MAXQ the structure of the hierarchy is specified by the user. HEXQ

decomposes and solves this problem automatically. The taxi task is then varied by

• representing the state space with one more variable to show how HEXQ auto-

matically adapts to exploit any new constraints,

• forcing an order on some of the variables to demonstrate the robustness of

HEXQ,

• making the passenger “fickle” to show that HEXQ finds a different decompo-

sition to avoid an otherwise poorer solution and

• including fuel to demonstrate how HEXQ manages the hierarchical credit as-

signment problem.

The other domains covered are: a 25 rooms problem to test for region boundaries

without higher level variables changing value; the Towers of Hanoi puzzle with 7

discs demonstrating HEXQ building a hierarchy with 7 levels; and a ball kicking

102

Empirical Evaluation 103

task where HEXQ saves 5 orders of magnitude in storage space over a flat1 learner

making the learning task feasible.

6.1 The Taxi Domain

In the taxi domain shown in figure 6.1, a taxi started at a random location navigates

around a 5-by-5 grid world to pick up and then put down a passenger. There are four

possible source and destination locations, designated R, G, Y and B encoded 1, 2, 3,

4 respectively. The objective of the taxi agent is to go to the source location where

the passenger is waiting, pick up the passenger, and navigate with the passenger

in the taxi to the destination location. Once there the agent has to put down the

passenger to complete its mission. The source and destination locations are also

chosen at random for each new trial. At each step the taxi can perform one of six

primitive actions. There are four navigation actions: move one square to the North,

South, East or West, one action to pickup the passenger and one action to putdown

the passenger. A move into a wall or barrier leaves the taxi location unchanged. For

a successful passenger delivery the reward is 20. If the taxi executes a pickup action

at a location without the passenger or a putdown action anywhere other than at the

destination location it receives a reward of -10. For all other transitions the reward

is -1. The trial terminates following a successful delivery. The navigation actions

are stochastic, in the sense that, with an 80% probability the action is performed as

intended and with a 20% probability the agent attempts to move either to the left

or to the right of the intended direction with equal chance.

1To distinguish a normal MDP from a hierarchically decomposed structure the former is referredto as flat.


0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 22 23

Figure 6.1: The Taxi Domain.

6.1.1 Automatic Decomposition of the Taxi Domain

Testing HEXQ on this domain provides an example of how HEXQ decomposes a

3-dimensional problem forming a three level hierarchy. The HEXQ algorithm from

Chapter 5 is reproduced here for convenience and will be explained step by step.

The taxi problem can be formulated as an episodic MDP with the 3 state vari-

ables: the location of the taxi (values 0-24), the passenger location including being

in the taxi (values 0-4, 0 means in the taxi) and the destination location (values

1-4). It is easy to see that for the taxi to navigate to one of the source or destina-

tion locations, the navigation policy can be the same whether it intends to pick up

or put down the passenger. The usual flat formulation of the MDP will solve the

navigation subtask as many times as it reoccurs in the different contexts.

Dietterich (2000) shows how, by designing a MAXQ hierarchy, the problem is

solved more efficiently with subtask reuse. The problem will now be automatically

decomposed and solved with HEXQ. Progress through the HEXQ algorithm will be

traced in detail.

The first step in the HEXQ algorithm (table 6.1) is to order the variables by

frequency of change. For the taxi example, table 6.2 shows the frequency that each


Table 6.1: The HEXQ algorithm

function HEXQ( MDP〈states = X, actions = A〉 )

1. X ← sequence of variables 〈X1, X2, . . . , Xd〉 sorted by frequency of change

2. S1 ← X1

3. A1 ← A

4. for level e ← 1 to d− 1

5. explore Se, Ae transitions at random to find T ass′ , Exits(Se), Entries(Se)

6. MERe ← Regions(Se, Ae, T ass′ , Exits(Se), Entries(Se))

7. construct sub-MDPs from MERe using Exits(Se)

8. for all sub-MDPs m

9. E∗m(se, ae) ← V alueIteration(sub-MDP m, γ = 1 )

10. Ae+1 = ∪i∈MEReExits(i)

11. Se+1 = MERe ×Xe+1

12. Execute(level d, initial state, top level sub-MDP)

end HEXQ

variable changed value during a 2000 random action exploration run. The duration

of 2000 steps is specified by the user. The taxi location variable changes value more

frequently than the others. The passenger location variable only changes value

when the passenger is picked up and the destination variable remains constant for

the entire trial. The order in which the variables are therefore processed to build

the hierarchy is: taxi location, passenger location and finally destination.

Having selected the taxi location as the first variable, the algorithm performs

Table 6.2: Frequency of change for taxi domain variables over 2000 random steps.

Variable Frequency Order

Passenger location 4 2

Taxi location 846 1

Destination 0 3


the decomposition of the state space into MERs in line 6 of table 6.1. The state

space is described by (s1, y) where s1 = Taxi location and y = Passenger location ×Destination. To find the directed graph representing transitions between values of

the taxi location, the taxi agent again performs random actions until all primitive

actions have been taken a predefined 180 number of times for each state variable

value. The 180 is user defined for this problem and is selected in such a way that

all exits are consistently found for up to 100 runs of the experiment. The results of

random trials for ordering the variables could be reused for finding regions, but this

efficiency has not been implemented.

For each state s1 and primitive action the frequency of the next s1 state variable

values and the rewards received are recorded. These statistics have two uses. In

this decomposition step they are used to decide whether there are any non-Markov

transition exits by using the binomial and K-S tests as described in section 5.2.1.

Later they are used to solve the sub-MDPs using value iteration. For the taxi

example the directed graph for the taxi location variable is shown in Figure 6.2. Exit

transitions are not counted as edges. For example, taking action pickup or putdown

in state s1 = 23 may change the passenger location variable or reach the goal,

respectively. This means that this transition may change the y variable. A change

in the y variable value indicates an exit. However, it is possible to probabilistically

predict the transitions anywhere else in the environment. From state 7, taking

action North for example, transitions to state 2, 80% of the time, 10% of the time

to state 8 and 10% to state 7. Only one directed edge is drawn to represent multiple

transitions between the same states in Figure 6.2 to avoid cluttering the diagram.

All states are entry states because the taxi agent is started at random in any

location. As the whole graph for the taxi location variable is strongly connected,

one MER is formed which meets the condition of a HEXQ partition that all entry

states must reach all exits states without leaving the region. The MER is labelled

0. It has eight exits. They are:


0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

exits

exitsexits

exits

Figure 6.2: The directed graph of projected state transitions for the taxi location.Exits are not counted as edges of the graph.

(s1 = 0, a1 = pickup), (s1 = 0, a1 = putdown),



(s1 = 23, a1 = pickup), (s1 = 23, a1 = putdown).

Line 7 of the HEXQ algorithm now creates 4 sub-MDPs as shown in figure 6.3.

Each sub-MDP has a sub-goal to reach one of the exit states. The sub-MDPs are

solved using value iteration with the transition model (transition probabilities and

rewards) determined from the recorded statistics in line 5 of the HEXQ algorithm.

The solution to each sub-MDP m determines the optimum HEXQ action value func-

tion E∗m. As this is level 1, these values are identical to the Q action value function.

The value E∗m(s1, a1) is the expected maximum reward that can be accumulated on

the way to the exit state for sub-MDP m by starting in state s1, taking action a1

next and then continuing with the optimum policy.


0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

exits

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

exits

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

exits

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

exits

Figure 6.3: The four level 1 sub-MDPs for the taxi domain, one for each hierarchicalexit state, constructed by HEXQ.

Step 10 of the HEXQ algorithm generates abstract actions for the next level by

taking the union of all exits over all MERs. Abstract actions are encoded in the

same way as exits. The 8 abstract actions are the 8 exits listed above.

Line 11 of the HEXQ algorithm generates abstract state variable values for the

next level. Applying the method in equation 5.5 simply generates variable values

such that s2 = x2 as there is only one MER labelled 0. In other words, the abstract

state variable values at level 2 are the same as the values of the next most frequently

changing state variable, the passenger location. Therefore there are 5 abstract states

and 8 abstract actions at level 2.

At this stage HEXQ loops back to step 5 to decompose the level 2 state space.

Figure 6.4 shows the directed graph generated after a random execution of abstract

actions, exploring each action from each state 5 times (user defined). There are 5 s2

abstract state nodes. Each one represents the abstracted MER from level 1. In the

figure one of the level 2 abstract states, state 3, is illustrated with the detail of the

level 1 MER that it represents. Level 2 edges show the effect of the abstract actions.


With the passenger at a particular source location 1, 2, 3 or 4, the only abstract

action that successfully places the passenger in the taxi is the one that navigates

the taxi to passenger source location and performs a pickup. Other abstract actions

leave the passenger location unchanged. The transitions in figure 6.4 that cause a

change to state 0 are labelled with the abstract actions.

With the passenger in the taxi, s2 = 0, there are four abstract actions that cause

exits. These are the ones that navigate to one of the source/destination locations

and perform a putdown primitive action. They are shown in figure 6.4. For example

exit (s2 = 0, (s1 = 23, a1 = putdown)) means: with the passenger in the taxi,

navigate to location s1 = 23 and putdown the passenger. They are exits because

they may cause the MDP to terminate (see exit definition 4.1). The exit notation

at level 2 is (s2, a2) where a2 = (s1, a1).

1

2

4

0

(s1=0, a=pickup)

(s1=4, a=pickup)

(s1=20, a=pickup)(s1=23, a=pickup)

(s2 = 0, (s1 = 0, a = putdown)) (s2 = 0, (s1 = 4, a = putdown)) (s2 = 0, (s1 = 20, a = putdown)) (s2 = 0, (s1 = 23, a= putdown))

Level 2 exits {

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

3

Figure 6.4: State transitions for the passenger location variable at level 2 in thehierarchy. There are 4 exits at level 2..

This time the directed graph is not strongly connected. The degenerate 5 SCCs


found by the algorithm are the individual states. However, because the problem is

never started with the passenger in the taxi, state s2 = 0 is not an entry state. This

means that the SCCs are merged into one MER which complies with the reachability

condition of a HEXQ partition that all entry states can reach all exit states.

The hierarchical exit states for this MER are (s2 = 0, s1 = 0), (s2 = 0, s1 = 4),

(s2 = 0, s1 = 20), (s2 = 0, s1 = 23). From subsection 5.3, because the hierarchical

exit states are all different for the four exits at level 2, it is necessary to create 4

sub-MDPs at level 2.

Four abstract actions corresponding to the policies of the four sub-MDPs (one

for each exit) are created for level 3. The abstract states s3 generated for level 3

correspond to the destination states as there is only one MER at level 2. The HEXQ

algorithm now drops to step 11 to solve the top level sub-MDP. This sub-MDP is

trivial to solve as each abstract action either directly solves the problem or leaves

the agent in the same state, that is, with the passenger still to be delivered. The

directed graph for level 3 is shown in figure 6.5. Note the nesting in the description

of abstract actions.

The resultant HEXQ graph for the decomposed MDP is shown in figure 6.6.

There are 9 sub-MDPs in total in the hierarchy. To illustrate the execution of a

competent taxi agent on the hierarchically decomposed problem, assume that the

taxi is initially located randomly at cell 5, the passenger is on rank 4 and wants to

go to rank 3.

In the top level sub-MDP, the taxi agent perceives the passenger destination as

3 and takes abstract action a3 = (s2 = 0, a2 = (s1 = 20, a1 = putdown)). This sets

the subgoal state at level 2 to s2 = 0 or in English, pick up the passenger first. At

level 2, the taxi agent perceives the passenger location as 4, and therefore executes

abstract action (s1 = 23, a1 = pickup). This abstract action sets the subgoal state

at level 1 to taxi location s1 = 23 i.e. location 4. The level 1 policy is now executed

using primitive actions to move the taxi from location s1 = 5 to the pickup location


1

2

3

4

(s2 = 0, (s1 = 0, a = putdown))

(s2 = 0, (s1 = 4, a = putdown))

(s2 = 0, (s1 = 20, a = putdown))

(s2 = 0, (s1 = 23, a = putdown))

Figure 6.5: The top level subMDP for the taxi domain showing the abstract actionsleading to the goal.

s1 = 23 and the pickup action is executed on exit. Level 1 returns control to level 2

where the state has transitioned to s2 = 0. Level 2 now completes its instruction and

takes abstract action (s1 = 20, a1 = putdown). This again invokes level 1 primitive

actions to move the taxi from location s1 = 23 to s1 = 20 and then putdown to exit.

Control is returned back up the hierarchy and the trial ends with the passenger

delivered correctly.

The Taxi example illustrates how HEXQ automatically decomposes a 3-dimensional

MDP without the benefit of a prior model. The construction of level 2 in the hi-

erarchy demonstrates the merging of five SCCs to form one MER. While there was

only one exit state at the same level (s2 = 0), 4 sub-MDPs are required as each

hierarchical exit state is unique. In total 9 sub-MDPs are generated to construct

the HEXQ hierarchy.


Level 3 sub-MDP

4 Level 2 sub-MDPs

4 Level 1 sub-MDPs

4 abstract Actions

8 Abstract Actions

6 Primitive Actions

Level 3

Level 1

Level 2

Figure 6.6: The HEXQ graph showing the hierarchical structure automatically gen-erated by the the HEXQ algorithm.

6.1.2 Taxi Performance

It is possible to demonstrate performance improvements over a flat reinforcement

learner for the simple taxi task both in terms of the number of primitive time steps

required to reach optimality and in the size of the table required to store the value

function.

For the experiments, the stochastic taxi task described previously is used. The

initial value functions, Q for the flat MDP, C for MAXQ and E for HEXQ are ini-

tialised to zero. The learning rate for all algorithms was set to a constant 0.252. Each

algorithm is allowed to choose actions greedily and relies on the natural stochasticity

and properties of the domain to explore all the states. HEXQ is run as described

above, automatically constructing the HEXQ graph. To compare it to MAXQ3 with

2The effectiveness of convergence with this learning rate is examined in appendix B3The implementation MAXQ used for these experiments is that interpreted and programmed

by the author.


a user provided decomposition, HEXQ is also run with a previously found HEXQ

graph, thereby excluding the hierarchy construction load from the performance.

Simple one-step value backups are used to learn the value function for HEXQ,

MAXQ and the flat reinforcement learner. When HEXQ includes construction it

uses action-value iteration to find the optimal value function of the sub-MDPs after

it has learnt the model by statistically sampling transitions as described in Chapter

5.

-5

-4

-3

-2

-1

0

1

0

1000

0

2000

0

3000

0

4000

0

5000

0

6000

0

7000

0

8000

0

9000

0

1000

00

Time Steps

Ave

rage

Rew

ard

per

Tim

e S

tep

HEXQ including construction

Flat

MAXQ

HEXQ excluding

construction

Figure 6.7: Performance of HEXQ with and without hierarchy construction againstMAXQ and a flat MDP for the stochastic taxi.

The results are shown in figure 6.7. The average primitive reward received per

primitive time step is plotted against the number of primitive time steps elapsed.

A trial ends whenever the taxi agent delivers the passenger. After each trial, the

taxi agent is restarted again with random variable values and the learning contin-

ues. Each 100, 000 primitive time steps constitute a run. After each run the whole

problem is reset to the state before hierarchy construction and learning takes place.

The graph shows the performance averaged over 100 runs with the average reward


per time step averaged over 100 time steps.

Comparing “HEXQ with automatic hierarchy construction” against the flat learner,

the graph shows that HEXQ converges to the optimum policy in about half the num-

ber of time steps. The flat learner is able to start improving its policy right from

the start while HEXQ must first order the variables and find level 1 exits before it

can start to improve its performance. More recently, Potts and Hengst (2004) have

shown how learning can be speeded up by constructing regions at multiple levels

concurrently. Once the navigation MER is discovered, HEXQ learns very rapidly as

it transfers knowledge and reuses the optimal navigation policies in other contexts.

It is of course possible to improve both HEXQ and the flat learner using multi-step

backups or prioritised sweeping (Moore and Atkeson, 1993). The flat learner is

expected to benefit most from these techniques, making the case for hierarchical de-

composition less convincing for smaller problems. Any flat learner will eventually be

defeated by an exponential growth in the state space as the problem size increases,

which is not the case when problems can be efficiently decomposed.

As MAXQ is a priori provided with the hierarchical decomposition, for compari-

son HEXQ is tested with a decomposition found in a previous run. Not surprisingly,

as can be seen in figure 6.7, both “HEXQ excluding the hierarchical construction”

and MAXQ outperform the flat learner and “HEXQ including construction”. MAXQ

outperforms HEXQ. This is explained by the additional background knowledge pro-

vided to MAXQ in that pickup and putdown actions do not have to be explored

for the navigation task. HEXQ does not make this assumption and early average

reward per step is worse as it tests pickup and putdown actions in all possible taxi

locations.

The early dip in performance in the graph for “HEXQ excluding the hierarchical

construction” is caused by simultaneously learning the value function at multiple

levels. Lower level sub-MDPs initially have inflated costs to reach exits before

achieving optimal policies. These inflated values are incorporated in higher level


Table 6.3: Storage requirements in terms of the number of table entries for thevalue function for the flat MDP, MAXQ, HEXQ, HEXQ with stochastic actions aftereliminating no-effect actions and HEXQ with deterministic actions after eliminatingno-effect actions.

Level Flat MAXQ HEXQ HEXQ HEXQ

stochastic actions deterministic actions

no-effect actions no-effect actions

eliminated eliminated

3 16 16 16

2 160 160 160

1 600 396 272

Total 3000 632 776 572 448

value functions and cause sub-optimal behaviour in the process of being corrected.

Storage requirements for the value function are shown in table 6.3. A flat learner

uses a table of 500 states and 6 actions = 3000 values. HEXQ requires 4 MDPs

at level 1 with 25 states and 6 actions = 600 values; 4 MDPs at level 2 with 5

states and 8 actions = 160 values; 1 MDP at level 3 with 4 states and 4 actions =

16 values - a total of 776 values. MAXQ by comparison requires only 632 values

(Dietterich, 2000). Interestingly, if a HEXQ graph is generated with the no-effect

actions efficiency improvement in subsection 5.7 then only 572 E values are required

to represent the complete decomposed value function with stochastic actions and 448

E values with deterministic actions as all pickup and putdown actions have no effect

at level 1.

This simple taxi domain has empirically supported the theory in Chapter 4 with

a reduction in computational complexity for HEXQ compared to a flat learner. The

automatically constructed reinforcement learning hierarchy performs almost as well

as the hand coded MAXQ hierarchy when compared on an equal footing. If no-

effect actions are automatically eliminated, it is conjectured that HEXQ excluding

construction would at least equal MAXQ in performance.


6.1.3 Taxi with Four State Variables

For a better understanding of the HEXQ algorithm it is instructive to experiment

further with the taxi domain by defining an additional variable and experimenting

with the order of variable processing.

The taxi problem characteristics are left unchanged to those in the previous

section, except that the representation of the taxi location is now given by two

variables that represent the x and y coordinates in its 5 × 5 grid world illustrated

in figure 6.9 (a). For example, the taxi in the figure is shown in location (3, 1).

This means that the state is represented by the four variables: passenger location,

destination location, taxi x location and taxi y location.

Running the HEXQ algorithm with this state description, the taxi y location

variable is found to change more frequently than the taxi x location. The reason is

that the internal barriers in the grid world prevent the taxi from moving in the x

direction on some random instances. The y variable which can take on 5 different

values is therefore selected to construct level 1 in the HEXQ graph. This time any

navigation action from any taxi y location may change the taxi x location. Because

of the stochastic nature of the transitions, pickup and putdown actions do not cause

transitions between taxi y locations. Consequently all five taxi y location states are

separate MERs. Three of the MERs have 4 exits each and the other two MERs

which are the source and destination locations have 6 exits each. The sub-MDPs at

level 1 each have one state and their optimal policies are simply their exit actions.

Equation 5.5 will generate 25 states for variable s2 from both the next variable,

the taxi x location in the ordering and the 5 MERs from level 1. Because the

number of exits vary for the MERs, the actions available in each s2 state will vary.

Essentially the pickup and putdown actions will not be explored for any taxi x

location when the taxi y location is either 1,2 or 3 as they have already been ruled

out by the optimal policies at level 1.

HEXQ will now find one 25 state MER in the s2 abstract state space with 8 exits


-5

-4

-3

-2

-1

0

1

0

1000

0

2000

0

3000

0

4000

0

5000

0

6000

0

7000

0

8000

0

9000

0

1000

00

Time Steps

Ave

rage

Rew

ard

per

Tim

e S

tep

HEXQ with three state variables

HEXQ with four state variables

Figure 6.8: Performance of the stochastic taxi with 4 variables compared to thethree variable representation.

and complete the hierarchy construction as in the previous case. This is an example

where the HEXQ partition did not result in a useful state abstraction opportunity.

The performance between HEXQ with three and four variables is shown in figure

6.8. The major difference in performance is that the average reward per step is

reduced while exploring level 2 as pickup and putdown actions have already been

eliminated for certain y variable labels at level 1 and are avoided.

If instead of stochastic actions all actions are deterministic, then HEXQ will find

one MER with 14 exits for the taxi y location variable at level 1. The graph for

this MER is shown in figure 6.9 (b). This time the effect of the primitive actions

of moving North and South is independent of the taxi x location and the MER

generalizes this result accordingly. At the second level in the HEXQ hierarchy one

MER consisting of 5 s2 states and 8 exits are found. This models navigation in the

x direction using the abstract actions of locating the taxi at a particular y location


X

Y

(a)0

1

2

3

4

(b)

(c)

0 3 421

0 1 2 3 4

0

1

2

3

4

Figure 6.9: Taxi domain with four variables, (a) x and y coordinates for the taxilocation, (b) the y variable MER for deterministic actions, (c) the three MERs fordeterministic actions when the x variable is forced to level 1.

first and then moving East or West4. HEXQ explores all 10 navigation exits at level

2. The rest of the hierarchy is again constructed as before.

In the final experiment with the four variable representation of the taxi domain,

the taxi x location and taxi y location variables are forced to be processed by the

HEXQ algorithm in reverse order. As discussed in the HEXQ theory this does not

effect the operation of the algorithm, other than in the efficiency of the decompo-

sition. With the taxi x location used at level 1, HEXQ finds four MERs shown in

figure 6.9 (c). As can be seen in the diagram, the MERs from left to right, have 5, 4,

4 and 9 exits respectively. The only MER with more than one state is the rightmost

one representing the taxi x values 3 and 4. As predicted by the theory, the problem

is decomposed successfully, but less efficiently.

Even with the better order of variables, if the criteria for combining levels from

4Tom Dietterich, in private communications, has pointed out that a designer could restrict thetermination of the y region East and West from value 2 alone for the navigation sub-task in aMAXQ graph and still achieve an optimum solution.


section 5.7 is used, ScombinedLevel = 150 < StwoLevels = 220 and hence HEXQ

would be better off combining these variables in this situation largely because of the

large number of exits generated as seen in figure 6.9. With stochastic actions the

MERs are single states and the case for combination is even stronger. All the above

variations for this problem lead to a globally optimal solution.

The taxi domain with 4 variables has shown that HEXQ is robust in the face

of alternative problem descriptions, enforced variable orderings and the efficiency of

the decomposition will vary as anticipated.

6.1.4 Fickle Taxi

Dietterich designed the fickle taxi task to show that a MAXQ recursively optimal

policy is not hierarchically optimal.

The destination location in the original taxi domain is chosen at random at the

start of the problem and remains unchanged during the trial. In this experiment

the problem is complicated by the passenger stochastically changing his or her mind

about the destination during each trial. A fickle taxi is specified with the passenger

changing the destination location with a probability of 0.3 on pickup. The MDP

again uses the three variables: passenger location, destination location and taxi

location.

If the HEXQ graph from the original taxi domain is used with the fickle taxi, the

hierarchical constraints produce a recursively optimal policy that is not as good as

the hierarchically optimal (and globally optimal in this case) policy. The abstract

actions learnt by HEXQ for the top level sub-MDP commit the taxi to put down

the passenger at the original destination location (see figure 6.5). For hierarchical

execution the taxi will complete this abstract action even though the destination

may have changed during the trial.

HEXQ is proven to be hierarchically optimal. Is this therefore a contradiction?

The answer is no. If HEXQ is allowed to rerun with the fickle taxi specification,


-2

-1

0

1

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

Time Steps

Ave

rag

e R

ewar

d p

er T

ime

Ste

p

Optimal Policy

Fickle Taxiusing HEXQ graph

constructed for non-fickle taxi

Fickle Taxi using HEXQ graph constructed for fickle taxi

Flat MDP for reference

Figure 6.10: Performance of the taxi with a fickle passenger compared to the orig-inal decisive passenger. HEXQ results do not include the time-steps required forconstruction of the HEXQ graph.

then HEXQ builds a different hierarchical structure which is again hierarchically

optimal and in this case globally optimal. Figure 6.10 shows the average reward

per primitive time step over 100 runs for the fickle taxi for (1) the flat MDP, (2)

using the original HEXQ graph of subsection 6.1.1 and (3) using the HEXQ graph

constructed with the fickle specification. In the latter case, at level 2 in the HEXQ

graph, HEXQ finds 5 MERs in contrast to one MER for the original HEXQ graph

(see figure 6.4). This is because a pickup action may change the destination location

creating an exit. The fickle HEXQ graph passes control back up to the top sub-MDP

on a pickup action which provides the opportunity for the taxi to choose to go to

the changed destination.

Dietterich (2000) specifies a different fickle taxi to the one used here in that the

destination location is changed after the passenger has moved one square away from

the passenger’s source location. This fickle taxi specification is not Markov in terms

of the original representation, as the stochastic change in destination is dependent


on a previous state, namely the source location of the passenger. HEXQ is based

on the assumption that the original problem is an MDP. It is for this reason the

above example does not use Dietterich’s fickle taxi specification. Nevertheless, using

Dietterich’s specification for the fickle taxi, HEXQ will also construct a different

hierarchy and perform with similar results and conclusions to those above. There

are then, however, small differences in the final converged optimal reward per time

step values between the flat learner and HEXQ which cannot be explained in a

Markov setting.

The original intention was to repeat, albeit with automatic decomposition, the

sub-optimal results obtained with MAXQ. The optimal solution produced by HEXQ

was at first unexpected. The surprise in this experiment was that HEXQ constructed

a different hierarchy for the fickle passenger variation of the problem, although in

retrospect this is easily explainable.

6.1.5 Taxi with Fuel

Another instructive example is the introduction of fuel to the taxi domain. The

objective here is to show how HEXQ solves the “hierarchical credit assignment”

problem (Dietterich, 2000) implicitly.

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 22 23

Figure 6.11: The Taxi with Fuel.


The taxi problem from section 6.1 is modified with the inclusion of a fuel tank.

The taxi consumes one unit of fuel for each primitive navigation action it takes.

If the taxi runs out of fuel before reaching the destination location and delivering

the passenger the taxi is refuelled with a penalty reward of -20. There is a refilling

station as shown in figure 6.11. A refilling action is added making the total number

of actions seven. When the refilling action is executed at the refilling station, the taxi

is refuelled to a fuel level of 12 and the usual reward of −1 is incurred. If the refilling

action is taken other than at the refilling station a reward of −10 is incurred with

no effect on the fuel level. The taxi is fuelled stochastically to a random fuel level

between 5 and 12 units of fuel inclusive prior to starting each trial. The introduction

of fuel requires a fuel level variable with the 13 values 0, 1, 2, . . . , 12.

Dietterich introduced this variation to the taxi problem to illustrate the hierar-

chical credit assignment problem in MAXQ. In MAXQ a Refuel subtask is added

to the hierarchy by the designer with the objective of moving the taxi to the refill-

ing station and refuelling. Unfortunately, the navigation subtasks in MAXQ which

move the taxi to the source and destination locations do not know how much fuel is

in the tank. The navigation sub-problem is non-Markov in that it is not possible to

predict from the taxi location alone when the taxi will run out of fuel and incur a

reward of −20. The solution in MAXQ is to manually separate the rewards received

for navigation from those received for exhausting fuel and assigning the latter to the

root task in the MAXQ hierarchy so that it can learn whether to invoke the Refuel

subtask. The problem of decomposing the reward and assigning the components

to different sub-tasks is referred to as the hierarchical credit assignment problem

(Dietterich, 2000).

The HEXQ solution is very different and avoids the hierarchical credit assignment

problem. Because HEXQ creates sub-MDPs the issue of a subtask not being Markov

never arises. HEXQ finds that the fuel level variable changes most frequently and

therefore chooses this for level 1 in the HEXQ graph. As the taxi location variable


may change with every fuel level change, each of the 13 fuel levels becomes a separate

MER. The 7 primitive actions are exit actions for each of these MERs, other than

for fuel level 12 which has 6 exits as the refilling action has no effect under any

circumstances. There is very little benefit having each fuel level state as a separate

MER. It would be more efficient to combine the fuel variable with the taxi location,

the next most frequently changing variable (see subsection 5.7). HEXQ at level 2

effectively takes the Cartesian product of the fuel levels and the taxi location to

create 13 × 25 = 325 s2 states. At level 2 HEXQ discovers 1 MER with 104 exits.

The exits are generated when the passenger is picked up or delivered at the 4 source

or destination locations with any of 13 levels of fuel (2× 4× 13 = 104). Level 3 in

the HEXQ graph has 5 s3 states representing the location of the passenger. They

form one MER with 52 exits to drop off the passenger at the various destinations

and with various fuel levels remaining in the tank. The top level represents the

destination location variable as before.

Figure 6.12 shows the taxi location × fuel level MER with some example transi-

tions identified. The set of sub-MDPs for this MER learns how to navigate around

the taxi world and simultaneously control the amount of fuel in the tank. The

top level sub-MDPs can then learn the most cost effective combination of the two

abstract actions of fuel efficiently picking up and dropping off the passenger.

Figure 6.13 compares the performance of HEXQ for the taxi with fuel and a flat

reinforcement learner. In this experiment the state variables are taxi location ×fuel level and passenger location × destination location. The navigation actions are

stochastic as in the original taxi example. The learning rate is set to 0.1 which, as

appendix A shows, allows the flat reinforcement learner to more closely approach

the optimal solution. A crude ε-greedy exploration policy is used with both HEXQ

and the flat learner. In HEXQ exploration is only required at the top level as the

other sub-MDPs are solved using value iteration after collecting transition statistics.

The exploration rate ε is reduced from 0.3 to 0.1 and finally to 0.0. The exploration


Taxi LocationFu

el L

evel

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

1

2

3

4

5

6

7

8

9

10

11

12

Refuelaction

navigationactions

pickup and putdownexit actions

Figure 6.12: The MER in the taxi with fuel problem showing the taxi location ×fuel level state space and some example transitions.

rates are dropped at slightly different times for each of the learners simply to help

distinguish the two graphs. To sort the variables in HEXQ the number of random

primitive time steps is set to 5000. At level 1 the number of times each state-action

combination is explored is set to 50 which ensures that all exits are discovered most

of the time. The reward per time step is averaged over each 1000 time steps for 100

runs.

After finding the fuel-navigation MER, HEXQ is able to improve performance

more quickly than the flat learner as it transfers the learnt policies for this sub-task

to other combinations of passenger and destination locations. There are 104 exits at

level 1 generating 52 sub-MDPs one for each hierarchical exit state. HEXQ therefore

requires an E table with 120, 380 entries in contrast to the flat learner which requires

only 45, 500 Q values.

For this example it is not critical that HEXQ discovers all the exits. If exits


-6

-5

-4

-3

-2

-1

0

0 1000000 2000000 3000000 4000000

Primitive Time Steps

Ave

rag

e R

ewar

d p

er T

ime

Ste

p

epsilon = 0.3

epsilon = 0.1

epsilon = 0.0

Hierarchical greedy execution

HEXQ

FLAT

Figure 6.13: Performance of the taxi with fuel averaged over 100 runs using stochas-tic actions and a two variable state vector. HEXQ attains optimal performance afterit is switched to hierarchical greedy execution

are missed for some fuel levels then the taxi agent will “work around” the problem

by finding a suitable alternative exit at a small cost in performance. This usually

means the taxi will burn fuel on the spot until the next exit is reached. Less exits

also means that less storage is required for the E tables. There is a performance

resource usage tradeoff.

When the navigation actions are stochastic, HEXQ, with hierarchical execution,

achieves a hierarchically optimal policy but cannot reach the optimal performance

of the flat learner. Stochastic navigation actions mean that it is more difficult to

reach a particular exit state which is defined in part by the amount of fuel in the

tank. It is easy for the taxi to randomly slip away from the target exit state and

then not be able to reach it as the fuel level can only decrease in most situations.

To improve this situation HEXQ is switched to hierarchical greedy execution mode,

as discussed in section 4.3, after learning. In this mode, instead of persisting with

one particular exit, the position is reevaluated after every primitive step and the


optimal action chosen using the decomposed value function. As shown in figure

6.13 the performance with hierarchical greedy execution is at least as good as that

attained by the flat learner.

From a designer’s perspective it would be desirable to abstract away the fuel level

from the navigation task. Without a model and without the benefit of background

knowledge, the taxi agent needs to learn the relationship between fuel usage and

navigation. This has been achieved successfully by HEXQ. Ryan (2002) tackles the

same issue, that of uncovering the cause of the side effect of running out of fuel when

navigating.

The main purpose of this example was to demonstrate that HEXQ avoids the

hierarchical credit assignment problem. Because HEXQ partitions the state space

into smaller MDPs each sub-MDP will, by definition, only model a Markov reward

function.

6.2 Twenty Five Rooms

The twenty five rooms example illustrates the operation of the Markov transition

condition of a HEXQ partition, that is, the discovery of exits when higher level

variables do not change value. In the first example deterministic actions are used

and MERs can be easily visualized. The second example shows the detection of an

internal room barrier modelled as either a hard border or as a high cost threshold.

The twenty five room problem consists of 25 interconnected rooms as shown in

figure 6.14. It is essentially a larger version of the rooms example used throughout

this thesis to illustrate basic concepts. The two variables describing the state are

position-in-room and room-number. The encoding assumes that each room is a

5 × 5 square of cells and the 25 rooms are also arranged in a 5 × 5 pattern. In

the first version, as can be seen in the figure 6.14, the actual rooms vary in size

with some of the walls missing or moved. Two of the rooms are divided by diagonal


walls. Despite these distortions in rooms, the labelling of the states is unchanged.

The numbers in this figure are the position-in-room value. The room values are

not shown. Rooms are labelled 0-24 left to right, top to bottom. An agent is

located at random anywhere in this environment and is required to move to a fixed

goal location somewhere else in the environment, in this case room-number 18 and

position-in-room 24. The deterministic navigation actions are North, South, East

and West and incur a reward of −1 on each step.

The position-in-room variable changes more frequently than the room-number

variable. Consequently the position-in-room is used at level 1 in the construction

of the HEXQ graph. HEXQ discovers the four MERs shown on the right hand side

in figure 6.14. It is easy to visualise and verify from the diagram that transitions

internal to each MER are identical for each of their aliased instances in the envi-

ronment. There are some exit transitions for which the room-number variable does

not change. An example is the transition from position-in-room 20 to 21. For this

transition the room-number does not change, but the transition is not Markov as

there are two instances near the bottom right of the environment where the tran-

sition is blocked by a wall. HEXQ discovers these exits by testing for a stationary

probability distribution as described under Same Context Condition in section 5.2.

The top level sub-MDP will have 100 states generated from the 4 MERs at level 1

and the 25 room-number variable values.

In the second version, shown in figure 6.15, there are two changes. The actions

are now stochastic. With 80% probability the agent will move as intended and 20% of

the time it will not move at all. Secondly, the rooms are all uniformly interconnected,

except that in one room (the one containing the goal location) a one way barrier is

inserted horizontally as shown. The barrier is specified in two separate ways: (1) a

one way wall is constructed that prevents the agent from moving South through the

barrier, and (2) the cost for making a transition across this line is increased from 1

to 100.


0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

15 16 17 18 19

20 21 22 23 24

20

0

5

10

15

4

8 9

12 13 14

16 17 18 19

21 22 23 24

1 2 3

6 7

11

Figure 6.14: HEXQ partitioning of an MDP with 25 Rooms. The numbers indicatethe values for the position-in-room variable.

The numbers for each location in figure 6.15 indicate the final global optimum

value of each location as found by function value (table 5.4) from the decomposed

value function. The one-way border and the virtual barrier with high negative re-

wards caused HEXQ to find 5 exits using the binomial test and K-S test respectively

(see sections A.3 and 5.2.1) after executing each transition 5000 times. The success

of both of these tests is very sensitive to the nature of the stochastic actions. So for

example, when the agent slips to the left or right of the intended action, each with

a 10% probability, as for the stochastic taxi, not all the exits are discovered even

after exploring each transition 100,000 times.

There is a combination of factors that prevent the tests from finding exits reliably.

Even when transitions have a low probability, the states are connected by an edge

for the purposes of finding strongly connected components. If these low probability

edges are exits, considerable random exploration is required to build the necessary


44 43 42 41 40 38 37 36 35 34 32 31 30 31 33 32 33 31 30 29 28 26 25 26 2842 41 40 39 38 37 36 34 33 33 31 30 29 30 31 32 31 30 29 28 27 25 24 25 2642 40 39 38 37 36 35 33 32 31 30 29 28 29 30 31 30 29 28 26 25 24 23 24 2541 40 38 39 38 35 33 32 34 32 29 27 26 28 29 30 29 28 29 27 24 23 21 23 2440 38 37 38 39 33 32 31 33 34 27 27 25 26 28 30 28 27 28 29 23 21 20 22 2338 37 36 34 33 32 31 30 29 28 26 26 24 25 26 27 27 25 24 23 22 21 19 21 2237 35 35 33 32 31 30 29 28 27 25 24 23 24 25 26 25 24 23 22 20 20 18 19 2036 35 33 32 31 30 28 27 26 25 24 23 22 23 24 25 24 22 22 21 19 18 16 18 2034 34 32 33 32 29 28 27 27 26 22 22 20 21 23 25 23 22 22 22 18 17 15 16 1833 32 31 32 33 27 26 26 27 28 22 20 19 20 22 23 21 20 22 23 17 16 14 16 1732 31 29 28 28 26 25 24 23 22 21 20 18 19 20 21 20 19 18 17 16 15 13 14 1631 30 28 27 26 25 24 23 22 21 19 18 17 18 19 20 19 18 16 16 15 13 12 13 1429 29 28 26 26 24 23 21 20 19 18 17 15 17 18 19 18 16 16 14 14 12 11 12 1329 29 27 28 26 23 22 21 22 20 17 16 14 15 17 20 19 18 17 15 13 10 9 11 1227 26 25 27 27 22 21 20 21 22 15 14 13 14 16 22 20 19 18 17 11 10 9 10 1226 25 24 23 23 20 19 18 17 17 15 14 12 11 10 24 22 21 22 22 5 6 7 8 1025 24 23 22 21 20 18 17 16 15 13 12 11 10 9 24 23 22 23 24 4 5 7 7 825 24 22 21 19 18 17 16 15 13 13 11 10 9 7 6 5 3 3 1 3 4 5 6 825 24 23 22 20 19 18 16 16 15 13 12 11 10 9 5 4 2 1 0 4 5 6 7 826 26 25 23 22 21 19 19 17 16 14 14 12 10 9 4 3 1 0 5 6 8 9 1027 26 25 24 23 22 20 19 18 17 15 14 13 12 11 5 4 2 4 5 11 10 8 10 1126 25 24 22 21 21 19 18 17 16 14 13 12 11 9 6 5 4 5 6 10 11 10 11 1225 24 22 22 20 19 18 17 16 14 13 12 11 10 8 7 6 5 6 7 8 10 11 12 1326 24 24 23 22 21 19 18 17 15 14 13 11 11 10 9 7 6 7 8 10 11 12 13 1427 26 25 24 22 21 20 19 18 16 16 15 14 12 11 9 8 7 8 10 11 12 13 14 16

Figure 6.15: The optimal value for each state after HEXQ has discovered the one waybarrier constructed in the room containing the goal. The barrier was constructed intwo separate ways (1) by using a border that only allows transitions North and (2)a virtual barrier by imposing a reward of -100 for transitioning South. Both barriersproduced similar value functions.

sample size to conduct the tests to the level of significance specified. This can be

overcome by increasing the amount of exploration. A major impediment, however,

is that the characteristics of the problem may mitigate against collecting the right

statistics. Recall that these non-stationarity tests are based on samples that are

temporally close. The assumption is that the agent is likely to experience transitions

from one type of probability distribution in a concentrated set of experiences in time.

If the environment characteristics force the agent to other parts of the state space,

it may not be able to gather enough samples of a certain type of transition for the

tests to find exits. Relaxing the confidence levels of the tests would find too many

exits and the problem would not decompose efficiently.

A better approach may be to conduct the tests in the context of the higher


level variables for which the samples are known to come from only one stationary

distribution because the overall problem is assumed to be Markov. The number of

these contexts can be large, but it may be possible to select specific contexts that

improve the chances of finding exits. For example, if the agent becomes trapped,

it is a clear indication that an exit has been missed. The context that should be

tested is then the higher level variables defining the trapped situation. Section 9.1

expands on these proposals.

The Twenty Five Rooms problem has illustrated the discovery of exits when

higher level variables do not change value by testing for non-stationary state transi-

tion and reward functions for both deterministic and stochastic actions. The heuris-

tics for gathering samples were found to be inadequate to test for exits in some

stochastic settings. In many problems higher level variables will always change on

lower level region exits and these tests are not required.

6.3 Towers of Hanoi

The Tower of Hanoi problem is a puzzle that involves moving discs between pegs.

So far the Taxi, All Goals Maze and Twenty Five Rooms problems have involved

navigation. The Towers of Hanoi example should dispel any notion that HEXQ

is restricted to problems which are navigation dependent or require a metric state

space as an assumed inductive bias. This problem tests HEXQ with a larger number

of variables and therefore hierarchical levels. It will also demonstrate the potential

linear scaling in space complexity with the number of variables of a decomposed

MDP.

The puzzle is usually introduced with any number of different sized discs and

three pegs. The discs are placed in order of size on the first peg with the smallest

on top. The objective of the puzzle is to relocate the initial disc stack to the third

peg with the restrictions that the discs can only be moved from peg to peg one at


a time and at no time can a larger disc be on top of a smaller one.

To make the puzzle challenging this example will use 7 discs and three pegs. With

7 discs this problem has 37 = 2187 possible states. To see this, imagine allocating

the discs to the pegs. Each disc can be placed on any of the three pegs. There are 37

ways to distribute the 7 discs. Once the discs have been distributed to the pegs the

constraint that they are ordered by size on each peg uniquely determines a state.

Figure 6.16 (a) shows the initial configuration with all the discs on peg 1. Part (b)

of the figure shows the discs in one of many intermediate states and part (c) is the

goal state with all the discs on peg 3. The minimum number of moves required to

solve the 7 disc puzzle is 27 − 1 = 127.

Start Goal

(a)

(b)

(c)

discs

pegs

0 1 2

Figure 6.16: The Tower of Hanoi puzzle with seven discs showing (a) the start state,(b) an intermediate state and (c) the goal state.

The Towers of Hanoi puzzle can be represented as a 7-dimensional MDP where

each variable in the state vector refers to one of the discs and the variable value

indicates the peg number on which the disc sits. For example, the three legal states

of the puzzle from left to right in figure 6.16 may be encoded (0, 0, 0, 0, 0, 0, 0),

(2, 2, 1, 1, 1, 0, 2) and (2, 2, 2, 2, 2, 2) respectively with the variables ordered in as-

cending disc size. The variables could of course be listed in any arbitrary order in

the state vector, as HEXQ will first reorder them using the frequency of change

order heuristic of the algorithm. The actions of the MDP are defined in terms of


the from and to peg positions that a disc move is attempted. The six deterministic

actions are all of the form “move (top) disc on peg x to peg y”, where x 6= y and

x, y ∈ {0, 1, 2} and written movexy. For example, a next valid move from state (b)

in figure 6.16 may be to move the disc on peg 2 to peg 0 or move20. If an action

attempts to relocate a disc to a peg which contains a smaller size disc or the source

peg is empty, the move is illegal and fails with all discs staying in situ. All moves

or attempted moves incur a reward of -1.

In HEXQ the number of initial random actions to sort the variables by order of

frequency is set to 200,000. The number of random actions to explore exits at each

level in the hierarchy is set to 30. HEXQ sorts and then processes the variables

in order of disc size. This is not unexpected as it is always possible to move the

smallest disc and the probability of being able to move a disc decreases with size of

the disc. So the chances of moving the largest disc by applying random actions is

very small as it usually has other discs on top of it.

disc 1on peg 0

disc 1on peg 1

disc 1on peg 2

move02

move01

move20move10

move12

move21 move20

move02

move02

move01

move12

move21move10

move20

move21

move01

move10

move12

Figure 6.17: The directed graph for level 1 in the decomposition of the Tower ofHanoi puzzle showing one MER and six exits.

At each level of the hierarchy below the top level HEXQ will find one MER

and six exits. The directed graph for the first level is shown in figure 6.17. The


nodes represent the state values of the smallest disc, disc 1, and each of the edges is

labelled with the possible actions. For example, if the smallest disc is on peg 0 then

action move02 relocates it to peg 2. Action move12 may change the location of a

larger disc depending on the state of the puzzle. Therefore move12 executed when

the smallest disc is on peg 0 becomes an exit action as it violates the same context

condition of a HEXQ partition. The exits are:

(s1 = 0, a = move12)

(s1 = 0, a = move21)

(s1 = 1, a = move02)

(s1 = 1, a = move20)

(s1 = 2, a = move01)

(s1 = 2, a = move10)

There are six exits with three hierarchical exit states. This means that 3 sub-

MDPs will be generated for level 1 and six abstract actions available at level 2.

The directed graph for level 2 is shown in figure 6.18 with the six abstract

actions labelling the edges. The interpretation is a little more complex this time.

For example, with the second largest disc 2, on the leftmost peg 0, abstract action

(2, move01) will move the smallest disc to peg 2 and then move the disc on peg 0 to

peg 1. This will always have the effect of moving the second smallest disc from peg

0 to peg 1 because the only disc that could prevent this move, the smallest disc, has

been moved safely out of the way to peg 2. Again with disc 2 on peg 0, abstract

action (2,move10) will move the smallest disc to peg 2 and then attempt to move

the disc on peg 1 to peg 0. This will not succeed as any disc on peg 1 will be larger

than the second smallest disc occupying peg 0. Hence the node transitions to itself


disc 2on peg 0disc 1 on 0, 1 or 2



(1, move02)

(2, move01)

(1, move20)(2, move10)

(0, move12)

(0, move21) (1, move20)

(1, move02)

(1, move02)

(2, move01)

(0, move12)

(0, move21)(2, move10)

(1, move20)

(0, move21)

(2, move01)

(2, move10)

(0, move12)

Figure 6.18: The directed graph for level 2 in the decomposition of the Tower ofHanoi puzzle showing one MER and six exits.

for this abstract action. The abstract action (0,move12) will move the smallest disc

to peg 0. This means that the smallest and second smallest disc are both on peg

0. A disc move from peg 1 to peg 2 may relocate any other disc depending on the

state of the puzzle and hence (s2 = 0, (s1 = 0, a = move12)) is an exit at level 2 in

the HEXQ graph. There are six exits at level 2 shown in figure 6.18 that become

abstract actions at the third level.

The pattern repeats itself at each hierarchical level5. For example, to move the

i th largest disc from peg 0 to peg 2, all the smaller sized discs need to be located

out of the way on peg 1. Abstract action (1, (1, . . . (1,move02) . . .)) nested to a

depth of i − 1 will achieve this result. The total number of Q values required by

a flat reinforcement learner for the 7 disc puzzle is 2187 states times 6 actions or

13122 values. HEXQ only requires a total 342 E values to represent the same value

function in decomposed form. In general, given a Towers of Hanoi puzzle with d

discs the number of states is 3d and the number of actions 6. A flat reinforcement

learner will require 3d × 6 Q values. The Q table grows exponentially with the

5An interesting idea would be to introspect after building the hierarchy to discover recursiverelationships, as for example are evident between figures 6.17 and 6.18.


number of discs. A HEXQ decomposition in comparison has 3 (abstract) states and

6 (abstract) actions at each level. Three sub-MDPs are required at all levels except

the top one. HEXQ therefore only requires a total of 3 × 6 × 3 × (d − 1) + 3 × 6

E values to represent the value function. For HEXQ the storage requirements grow

linearly with the number of discs. Theorem 4.8 also guarantees that HEXQ will find

an optimal solution for any Towers of Hanoi puzzle.

100

1000

10000

100000

0 100 200 300 400

Trials

Ste

ps

to S

olu

tio

n

HEXQ

Flat

Optimum 127 steps

Figure 6.19: The performance comparison of a flat reinforcement learner and HEXQon the Tower of Hanoi puzzle with 7 discs.

The performance of a flat reinforcement learner (one step backup) is compared

to HEXQ for the 7 disc Towers of Hanoi puzzle in figure 6.19. The graph shows

the number of disc moves required to solve the puzzle for successive learning trials.

The results are averaged over 100 runs. The learning rate for both learners is set

to 1. The flat learner requires an exploration strategy to allow it to converge to the

optimum value. An ε-greedy exploration was used with ε set to 30% until trial 400.


Even this did not always guarantee that the flat learner would find the optimum

solution. The HEXQ results include the automatic ordering of the variables and

construction of the hierarchy. HEXQ solves the problem in about 30 trials compared

to 400 for the flat learner. In terms of total primitive actions or disc moves HEXQ

completes the problem in about 14% of the number required by the flat learner. For

the current implementation of HEXQ the number of exploratory random actions

required to sort the variables initially will increase exponentially with the number

of discs as sample movements of the larger discs will become less probable. It is of

course possible to delay the ordering of some of the variables until the lower levels

of the HEXQ hierarchy have been constructed and collect more ordering statistics

for higher level variables at the same time that exits are explored at each level.

The two constraints that remain to solve the Tower of Hanoi puzzle in linear time

complexity with the number of discs is that the number of disc moves to effect an

abstract action increases exponentially with level and the determination of the value

of a state from the decomposed value function is a depth first search. The former is

a feature of the mechanics of the puzzle and cannot be improved.

The latter issue was raised in subsection 5.4. With seven levels in the HEXQ

hierarchy the depth first search to find the value of states and the best action to

take becomes noticeably more expensive. One solution is to limit the search to a

specified depth. Intuitively this heuristic makes sense in that one would expect to

capture most of the value of a state at the higher levels in the decomposition where

the greater abstraction exists. This issue will be taken up again in Chapter 8.

The Tower of Hanoi puzzle can be made more challenging with the introduction

of stochastic actions. Disc moving actions perform as intended 80% of the time,

10% of the time the disc is mistakenly replaced on the same pin from which it was

removed and 10% it is mistakenly placed on the other unintended pin. With the

MDP representation defined as above HEXQ will decompose and solve the problem

with similar results and in a similar way to the problem with deterministic actions.


disc 1on peg 0

disc 1on peg 1

disc 1on peg 2

move02 0.8 move01 0.1

move01 0.8move02 0.1

move20

move10

move12

move21move20

move02

move02

move01





move21move01

move10

move12

move02 0.1 move01 0.1



Figure 6.20: The directed graph for level 1 in the decomposition of the stochasticTower of Hanoi puzzle showing one MER and 12 exits.

One MER is discovered per level, except this time each MER will have 12 exits which

generate 3 sub-MDPs per level. The directed graph with all the edges labelled with

their actions and their probability for level 1 of the HEXQ hierarchy is shown in

figure 6.20. Exit actions are shown as arrows leaving exit states.

This domain demonstrated HEXQ successfully constructing a seven level hier-

archy with three sub-MDPs per level. The space complexity grows linearly with

the number of variables in contrast to a flat learner where the state space grows

exponentially. With seven levels the best first search to find the value of a state and

the next best greedy action becomes prohibitive. It is possible to limit the depth

of the search for this puzzle and still obtain an optimal solution. A future research

direction is to characterise problems that allow good approximations when limiting

the depth of search.

6.4 Ball Kicking - a larger MDP

The final example illustrating the operation of the HEXQ algorithm is an example

inspired by the University of New South Wales’ success over the last five years in the


3 12

4

Figure 6.21: A simulated stylised bipedal agent showing its various stances.

international RoboCup soccer competition (Hengst et al., 2001). While this stylized

problem of a simulated humanoid learning to kick a soccer ball into the goal is a long

way from the ultimate RoboCup objective, it is designed to illustrate the ability of

HEXQ to automatically decompose and solve a larger MDP.

The agent in this case is a simulated stick figure humanoid robot as illustrated

in figure 6.21. Each foot of the agent is assumed to be able to move independently

to any one of 4 positions shown on the bottom left in the figure. This makes a total

of 16 possible states representing the stance of the two legs shown on the right of

the figure. Sixteen primitive actions are defined for the robot to assume any stance.

The soccer field is discretised into hexagonal cells. In order for the humanoid to

move over a cell and to another cell, it must assume precisely the series of stances

at four positions along each cell as indicated in the left of figure 6.22. The possible

cell positions and robot directions are indicated on the right. The series of stances

is meant to represent a cycle in a walking gate. As the humanoid is free to move its

legs to any stance at each of the four positions across the cell the state space has


(d)directions

positions

(i)

1 2 3 4

(ii)

1 2 3 4

(iii)

1 2 3 4

(iv)

1 2 3 4

12

3

4

Figure 6.22: The four stances (left) that comprise a successful traversal of a hexag-onal cell (right). Each of the six directions has 4 associate positions across the cell.One set is illustrated.

now grown to 16× 4 = 64 possible states.

Six primitive turning actions allow the humanoid to change the direction in which

it is facing. A turning action is only possible when changing from leg stance (ii) to

leg stance (iii) in figure 6.22. It can attempt to change direction in any cell position.

Changing direction is only effective when executed by the robot in an inner position

on the hexagonal cell and when facing towards the middle of the cell. An effective

turn will land the humanoid in position 3 facing in the direction dictated by the

action. Effective turning also needs to be discovered by the agent. With the six

directions added to the state description of the stance, the size of the state space

is 64× 6 = 384. The number of primitive actions has increased to 16 + 6 − 1 = 21

(one stance change action is performed simultaneously with the turning actions).

The humanoid agent does not have a model of the effect of the 21 actions when

in any of the 384 stances6. These must be discovered. Only the foot positions, the

6The walking model of this simulation is based on a similar idea in Uther (2002)


kick direction

random ball position after kick

Figure 6.23: The stylized soccer field illustrating the stochastic nature of the soccerball.

direction the robot is facing and the cell positions are represented. The position

and dynamics of rest of the body of the robot, including the arms, can be animated

in synchronisation with the feet for visual effect, but do not play a role in the

humanoid’s behaviour.

The soccer field is modelled by 20×30 = 600 hexagonal cells with each goalmouth

4 hexagonal cell wide as shown in figure 6.23. The ball is assumed to be located on

a hexagonal cell. It is “kicked” when the robot leaves the cell on which the ball is

located. The ball is kicked in one of the six directions that the robot is facing and

lands randomly between three and five cells forward and one cell to either side as

shown in figure 6.23 a the next time step. If the soccer ball is kicked outside the

field it is replaced at the edge of the field where it crossed the boundary.

A 3-dimensional MDP is defined with the variables: robot stance (384 values),

ball location (600 values) and robot location relative to the ball (2501 values).

The robot location relative to the ball is represented as a cell position relative


Figure 6.24: The deictic representation of the location of the ball relative to theagent.

to the ball as shown in figure 6.24. This is a deictic variable (Ballard et al., 1996)

and defines a relative coordinate frame avoiding the unwanted variance that would

otherwise be introduced by describing the robot’s position on field. Its values are

allowed to range from −30 to +30 horizontally and from −20 to +20 vertically.

While the ball is restricted to stay within the soccer field boundary the robot can

run outside the boundary to approach the ball. Both the robot and the ball are

initially (and after each trial) placed independently and randomly anywhere on the

soccer field. The reward is −1 for each primitive action. The trial terminates when

the robot has kicked the ball into the blue goal (right hand side).

The total state space size is therefore 384×600×2501 = 576, 230, 400. Together

with the 21 primitive actions, a flat reinforcement learner would require a Q table

size of 576, 230, 400× 21 or over 10 billion table entries.

HEXQ orders the variables using an initial 10, 000 random actions. Not surpris-

ingly the robot stance variable changes most frequently, followed by its ball relative


Figure 6.25: The HEXQ graph for the ball kicking domain. Level 1 regions learn to“walk”, level 2 regions learn to kick the ball and the top level learns to kick goals.

position. The ball location on the field is the least changing variable as the ball only

changes position on the field once kicked.

The HEXQ graph generated to solve this problem is illustrated in figure 6.25. At

level 1 in the HEXQ hierarchy one MER is found with 6 exits. The exits occur when

the robot changes hexagonal cell or the ball moves. There are 6 exits, one for each of

the possible directions that the robot is facing when on the last position on the cell.

The MER at level 1 is a state space in which the robot can change its leg positions

and direction. It is able to reach any adjacent cell with the correct sequence of

primitive moves. The six sub-MDPs generated, learn the optimal policy to move

the legs and change direction to allow the robot to choose to move to any of the 6

adjacent cells. An abstract action at level 2 is therefore seen to be a movement of

the legs through a cycle, moving the robot along the ground and possibly changing

direction in the process. It has the appearance and effect of an animated cartoon.

At level 2, HEXQ also finds one MER with six exits. This time the exits indicate


Table 6.4: The number of E action-value table entries required to represent thedecomposed value function for the soccer player compared to the theoretical numberof Q values required for a flat learner.

Level States Actions Exits sub-MDPs E/Q values

FLAT 576,230,400 21 12,100,838,400

1 384 21 6 6 48,384

2 2,501 6 6 6 90,036

3 600 6 3,600

Total HEXQ 142,020

that the ball location has changed. There are six exits because each of the six

abstract actions can “kick” the ball in different directions. The policies at level 2

translate to abstract actions at level three that have the effect of approaching the

ball and kicking it in any one of six possible directions. The top level sub-MDP

finally learns to kick the ball into the blue goal.

The number of table entries to store the decomposed action-value function E is

calculated in table 6.4. The saving in storage requirements is close to 5 orders of

magnitude. With each exploration set to 1 for each transition on the first two levels

of the hierarchy HEXQ will decompose and solve this problem in less than a minute

on an 400MHz Pentium desktop computer. The straightforward flat formulation of

this problem is intractable and no comparative results can be given. By corollary

4.8 from section 4.3.1 the HEXQ solution is globally optimal.

6.5 Conclusions

This chapter has empirically evaluated the HEXQ algorithm on a variety of deter-

ministic and stochastic shortest path problems to test its automatic decomposition

and solution properties. The experiments have shown that HEXQ:

• may find Markov equivalent sub-spaces in the global problem. The amount


of exploration required to discover all exits is currently tuned manually. A

solution to this issue is further discussed in Chapter 9.

• may decompose some stochastic shortest path problems into a hierarchical

structure of smaller sub-MDPs and solve the overall problem.

• constructs multiple levels of hierarchy.

• decreases the computational complexity in both storage and learning time

from exponential to linear in the number of variables in some cases. Once

solved, the execution time for HEXQ increases exponentially with the number

of variables. Solutions to this problem are discussed in Chapter 8.

• is robust against a forced change in variable ordering and different number of

variables to describe the same problem.

• will reconstruct the hierarchy to insure the solution remains hierarchically

optimal if the problem characteristics are changed.

• will find a globally optimal solution for deterministic transitions or when tran-

sitions are stochastic at the top level only.

• avoids the hierarchical credit assignment problem as rewards on exit are mod-

elled at higher levels where they can be explained. All regions are Markov,

insuring no reward surprises.

• will decompose problems when different state transition and reward function

behaviour is found for different higher level variable values. Exits need to be

detected when higher level values do not change value. For stochastic problems

the temporally close sampling heuristics currently employed are found to be

sensitive to the stochastic nature of the environment. A solution is proposed

and left for future implementation.


The results presented in this Chapter are consistent with the HEXQ decompo-

sition theory and the requirements of the HEXQ algorithm. The automatic decom-

position and scaling potential of HEXQ have been successfully demonstrated.

Chapter 7

Decomposing Infinite Horizon

MDPs

The thesis to this point has considered the automatic decomposition of stochastic

shortest path problems. Stochastic shortest path problems are episodic, meaning

that the task will eventually terminate. This chapter will extend the HEXQ auto-

matic decomposition approach to infinite horizon Markov decision problems. These

are MDPs for which a policy may require an agent to continue executing a policy

indefinitely.

Infinite horizon control policies are quite common. Examples include controlling

a car to stay in a lane and pole balancing. It is easy to construct an infinite horizon

version of the taxi task introduced in Chapter 6. An infinite horizon specification

could simply generate a new passenger and destination after each passenger drop-off

and require the taxi to continue to deliver new passengers forever.

Infinite horizon MDPs present three issues for HEXQ:

• retaining state abstraction with a discounted value function,

• guaranteeing the termination of sub-MDPs and

• allowing for the possibility that an optimal policy may require the agent to

146

Decomposing Infinite Horizon MDPs 147

continue execution in a lower level sub-MDP.

The first issue is caused by discounting. Recall that the value function of a state

is the expected sum of discounted future reward. A reward N steps in the future is

discounted by γN−1, where γ is a constant discount factor less than one. Discounting

is necessary to ensure that the value function remains bounded. Unfortunately, the

amount of discount applied to rewards after exiting a sub-MDP now depends on

how many steps it takes to reach the exit. In figure 7.1 the discount after exiting is

smaller from the perspective of state s1 than for state s2 because of the extra time

steps required to reach the exit.

sub-MDP ma

for region g

s2

s1

a

Exit value: E(s,a)

Figure 7.1: For a discounted value function, the amount of discount applied afterexiting the sub-MDP depends on the number of steps required to reach the exit.

One solution is to store separate exit values for each state of a sub-MDP. This

gives up state abstraction of the exit values and with it the scaling potential for

HEXQ1. State abstraction has been shown to be important for scaling in reinforce-

ment learning by other researchers (Dayan and Hinton, 1992, Kaelbling, 1993, Di-

etterich, 2000, Andre and Russell, 2002). The consequences of discounting in hier-

archical reinforcement learning with a single decomposed value function presents a

1Dietterich also discusses this issue under result distribution irrelevance (Dietterich, 2000),concluding that abstractions of this kind are only possible in an undiscounted setting [in MAXQ].


major impediment to scaling.

This chapter will show how this issue can be overcome with the introduction

of an additional decomposed discount function. This discount function, working

in concert with separate sub-MDP discounted value functions, allows safe state

abstraction in the face of discounting. The discount function stores the amount

of discount required on region exit for each state in the region. This approach

is a natural extension to the automatic decompositions produced by HEXQ and

can be applied to other hierarchical reinforcement learning methods with additional

constraints.

The second issue is to ensure the termination of sub-MDPs to generate proper

policies for exiting the region. The issue arises because HEXQ expects a sub-MDP

with an exit to find a policy that will actually use this exit. However, discounting

or the possibility of positive rewards may result in an optimum policy that does not

exit the sub-MDP. One solution is to artificially inflate the termination value at the

exit of a sub-MDP to overcome any reluctance to exit. For stochastic shortest path

problems this issue does not arise as the discount factor is one, all rewards inside

the sub-MDP are negative and the termination value is zero. For these problems

exiting is assured. One solution for infinite horizon problems is to define a large

positive pseudo exit value to overcome any positive rewards inside the sub-MDP or

effects due to discounting and thereby force an exit.

The final issue concerns the possibility that an optimum policy contains a non-

terminating abstract action. For stochastic shortest path problems the task always

terminates and all sub-MDPs policies are proper, meaning that they will eventually

exit. For infinite horizon problems it is possible that the best policy is for an

agent not to exit a sub-MDP but to continue to execute it forever. A hierarchical

reinforcement learning algorithm must create and allow an agent to explore such

continuing abstract actions.

As an example of a problem with a non-terminating abstract action policy, con-


sider the ball kicking agent from section 6.4, but this time with rewards constructed

to make it more profitable for the agent to run around the ball rather than kick

goals. Running around the ball is achieved by continuing in the second level region

that models the agent’s relation to the ball and ignores the ball’s location on the

field. This presents an execution dilemma. If the agent continues to run around the

ball it is prevented from exploring policies that involve kicking the ball.

The solution presented in this Chapter is for HEXQ to create non-terminating

sub-MDPs for each Markov equivalent region. These sub-MDPs are artificially

timed-out during learning, allowing higher level solutions to be explored. By allow-

ing non-terminating abstract actions, a recursively optimal solution can be found

for HEXQ decomposable infinite horizon MDPs, even when the solution requires

continuing in a sub-task.

Infinite horizon modifications to both the taxi domain and the ball kicker task

will be used to illustrate these extensions to HEXQ.

7.1 The Decomposed Discount Function

The aim of this section is to derive decomposition equations for a discounted value

function that retain the property of state abstraction. These equations are needed to

allow HEXQ to state abstract infinite horizon MDPs. This section retraces some of

the derivation steps for the HEXQ decomposition equations for stochastic shortest

path problems in Chapter 4, but this time with a discounted value function.

From Chapter 2 the value function for state s, as the expected sum of future

discounted rewards, is

V πm(s) = E{r1 + γr2 + γ2r3 . . .} (7.1)

where π is a stationary policy, m is a HEXQ sub-MDP, rt are the primitive rewards

received at each time step and γ is the discount factor.


In general sub-MDP m uses abstract actions to invoke lower level sub-MDPs

based on policy π. Allowing random variable N to represent the number of steps

required to exit the next lower level sub-MDP, ma, equation 7.1 can be written as

the sum of two series

V πm(s) = E{

N−1∑n=1

γn−1rn}+ E{∞∑

n=N

γn−1rn}. (7.2)

The first series is the local discounted value function for sub-MDP ma where the

termination is defined by the zero reward exit to an absorbing state. Isolating the

reward on exit in the second series gives

V πm(s) = V π

ma(s) + E{γN−1rN +

∞∑n=N+1

γn−1rn}. (7.3)

If s′ is the state reached after exiting the sub-MDP ma, R the expected reward

on exit to s′ after N steps and defining P πm(s′, N |s, π(s)) as the joint probability of

reaching state s′ in N steps starting in state s and following policy π, then equation

7.3 becomes

V πm(s) = V π

ma(s) +

∑

s′,N

P π(s′, N |s, π(s))γN−1[R + γV πm(s′)] (7.4)

Importantly, for a HEXQ partition, the termination state s′ reached and the

expected reward R on exit are independent of the number of steps, N , to reach the

exit. This is the case because (1) there is only one exit state defining each sub-MDP,

(2) that state is reachable with probability one, and (3) terminating the sub-MDP

is constrained via this one exit state. This is the key property that makes it possible

to abstract a whole region into an aggregate state2. Independence means that:

2Multi-time models for options (Precup, 2000) and similar approaches for MAXQ (Dietterich,2000), HAMs (Parr, 1998) and ALisp (Andre and Russell, 2002) are used to apply the correctdiscount on the termination of abstract actions. The important difference here is the guarantee of


P π(s′, N |s, a) = P π(s′|s, a) · P π(N |s, a) (7.5)

Equation 7.4 using abbreviation a = π(s) becomes

V πm(s) = V π

ma(s) +

∞∑N=1

P π(N |s, a)γN−1 ·∑

s′P π(s′|g, a)[R + γV π

m(s′)].

Definition 7.1 (Discount Function) The discount function Dπm is the expected

discount to be applied to the exit value of m, under policy π, where N represents the

random number of steps to exit.

Dπm(s) =

∞∑N=1

P π(N |s, π(s))γN−1 (7.6)

The discounted version of the exit action value function E (from Chapter 4) is

defined as follows:

Definition 7.2 (E Function) The HEXQ action-value function E (or exit value

function) for all states s in region g is the expected value of future discounted rewards

after completing the execution of the abstract action a exiting g and following the

policy π thereafter. E includes the expected primitive reward on exit, but does not

include any rewards or discounting accumulated inside g.

Eπm(g, a) =

∑

s′P π(s′|g, a)[R + γV π

m(s′)] (7.7)

were P π(s′|g, a) is the probability of transitioning to state s′ after abstract action

a terminates from any state s ∈ g and R is the expected final primitive reward on

transition to state s′ when abstract action a terminates.

Equation 7.6 is now succinctly written as:

state abstraction in addition to the correct discounting for abstract actions.


V πm(s) = V π

ma(s) + Dπ

ma(s) · Eπ

m(g, a) (7.8)

The discount function D is itself recursively represented. If N = N1 +N2, where

N1 is the random number of steps to exit sub-MDP ma reaching state s′ and N2 is

the balance of steps to exit sub-MDP m. N2 and s′ are independent of N1 and s.

From equation 7.6

Dπm(s) =

∑N1,N2

P π(N1|s, a)P π(N2|s, a)γN1+N2−1

= γ∑N1

P π(N1|s, a)γN1−1 ·∑N2

P π(N2|s, a)γN2−1

= γDπma

(s)Γπm(g, a) (7.9)

where

Definition 7.3 (Γ Function) The action-discount function Γ for all states s in

region g is the expected value of discount that is applied to state s′ reached after

completing the execution of the abstract action a exiting g and following the policy

π thereafter. Γ assumes that the value of the state following exit of sub-MDP m is

1 and all rewards leading up to exit are zero.

Γπm(g, a) =

∑

s′P π(s′|g, a)Dπ

m(s′) (7.10)

were P π(s′|g, a) is the probability of transitioning to state s′ after abstract action a

terminates from any state s ∈ g.

Equations 7.7, 7.8, 7.9 and 7.10 are the HEXQ decompositions equations for a

discounted value function following policy π. The optimum value functions, where

abstract action a invokes sub-MDP ma and state s is in region g, are


V ∗m(s) = max

a[V ∗

ma(s) + D∗

ma(s) · E∗

m(g, a)] (7.11)

D∗m(s) = γ max

a[D∗

ma(s) · Γ∗m(g, a)] (7.12)

This formulation requires only one value to be stored for functions E and Γ for

all states in sub-MDPs for region g. Safe abstraction of the sub-MDP’s states can

be retained as in the non-discounted case. This is achieved at the cost of storing

the separate on-policy action discount function Γ. The benefit is that decomposed

discounted value functions can, in the best case, scale linearly in space complexity

in the number of variables using HEXQ versus an exponential increase for the flat

problem. From equation 7.8 or 7.11 it can be seen that the top sub-MDP does not

need a discount function. The storage requirements only increase by less than twice

that required for an undiscounted episodic problem.

Figure 7.2 is a simple example of how an overall value function is composed for

an MDP with discounting. Figure 7.2 (a) shows a 2-dimensional (x, y) state MDP

with two identical regions with one exit to the right. The deterministic actions are

to move left and right and all rewards are -1 per step. The reward on termination

is 20. The discount factor is 0.9. The key point is that the composed value function

in 7.2 (b) exactly represents the original MDP value function by only storing the

exit value E for the two y variable labels at the top level sub-MDP (i.e. values 7.71

and 20.0).

A similar technique can be used with MAXQ. Termination of MAXQ sub-tasks

allows multiple exit states and for a similar treatment to work it is additionally

required that either all MAXQ sub-tasks have single exit states or that all states s′

reached after termination have the same exit value for any instance of the sub-task.

As each level of the HEXQ hierarchy is automatically constructed, the stationary

policy found is the recursively optimal policy for each sub-MDP. In section 4.3 it


x=0 x=1 x=2 x=3 x=4reward

20x=0 x=1 x=2 x=3 x=4

y=0 y=1

1.62 2.91 4.35 5.94 7.71 9.68 11.9 14.3 17.0 20.0

-3.44 -2.71 -1.90 -1.00 0.00

0.66 0.73 0.81 0.9 1.00

V(x)

D(x)

V(s)=V(x)+D(x)*7.71 V(s)=V(x)+D(x)*20

(a)

(b)

(c)

(d)

Figure 7.2: A simple example showing the state abstracted decomposition of a dis-counted value function. (a) shows a 2-dimensional MDP with two identical regionswith one exit to the right. The deterministic actions are move left and right, allrewards are -1. The reward on termination is 20. The discount factor is 0.9. (b) thecomposed value function for each state. (c) and (d) are the abstracted sub-MDPvalue and discount functions respectively.

was proven that the recursively optimal solution of a HEXQ decomposition is also

hierarchically optimal. This is no longer the case for a discounted value function

because the optimum value function and policy can be effected by the exit value of

a sub-MDP. Hierarchical execution therefore means a recursively optimal solution

for HEXQ.

In practice HEXQ determines the action discount function Γ by using dynamic

programming policy iteration in the same way that the local value function is cal-

culated during hierarchy construction. Function Execute (table 5.5 in Chapter 5) is

used to additionally update the discount function by performing on-policy updates

synchronistically with value function updates.

7.2 Infinite Horizon MDPs

This section explains how HEXQ is modified to solve infinite horizon problems.

It has been assumed that all sub-MDPs will terminate with probability one. This


is no longer the case because of discounting or the possibility of positive rewards.

To allow the HEXQ decomposition to work with all finite MDPs, termination of

sub-MDPs in the presence of positive and negative rewards needs to be guaranteed.

Even when all the rewards are negative, an optimal policy does not guarantee that

a sub-MDP will exit when using a discounted value function.

A pseudo exit function E that has a large enough positive termination value to

force any sub-MDP to exit is used. The pseudo exit function E is defined in the

same way as the exit function E, except that instead of a zero reward on exit, a

large enough positive reward is used to force the agent to exit the region3.

In summary there are three exit functions now for each sub-MDP in HEXQ. The

pseudo reward exit value function E determines the policies available as abstract

actions at the next level up in the hierarchy. The function Γ holds discount values

to be applied to the real exit value function at the next level up. The real exit

value function E holds the (in Dietterich’s words) “uncontaminated” exit values

for each sub-MDP. E, E and Γ are determined using dynamic programming during

hierarchy construction from the discovered model as for stochastic shortest path

problems. Subsequently E is updated on-line using temporal difference methods

and Γ and E are simultaneously updated following E’s policy in function Execute.

For infinite horizon problems it is of course possible that a recursively optimal

policy means that the agent may not wish to exit a sub-task at any level of the hier-

archy. Basic HEXQ does not provide for this option as every sub-MDP is designed

to exit. It would be a simple extension to allow the top sub-MDP to be continuing

(infinite horizon), but what about all others? Solutions that continue in a sub-MDP

below the top level are accommodated with the inclusion of a non-terminating policy

in each sub-task policy cache. The stay-in-region policy inclusion is also suggested

by Hauskrecht et al. (1998) for one of their methods for constructing a macro set

3A similar tick is used in MAXQ. In MAXQ the pseudo reward value function sets large negativevalues on undesirable terminations. These are not required in HEXQ because the HEXQ algorithmensures, by construction, that unwanted exits cannot occur.


(policy cache) over regions.

Recall that one sub-MDP is generated for each exit state in step 7 of the HEXQ

algorithm in table 5.1. Exiting is assured by the pseudo reward forcing function

which uses the incentive of a large positive reward on exit. To allow for the possibility

of a recursively optimal policy continuing in a sub-MDP, an additional sub-MDP is

created without an absorbing state for each region in the HEXQ partition. The stay-

in-region policy is induced by disallowing all exits for the sub-MDP. This creates

an abstract action at the next level and an extra policy in the policy cache that

continues in the sub-MDP forever.

To ensure that such a policy exists it is necessary to ensure that regions are

not forced to exit. The strongly connected components found by HEXQ have just

this property as all states can always reach each other. In the event of single state

regions, a no-exit policy is only generated if that state can transition to itself without

exiting4.

During learning, a time-out is now required for sub-MDPs. A non-exiting sub-

MDP will not return control to its parent and a means of interrupting the execution

of the sub-MDP is required. The number of steps that a sub-MDP executes is

counted and if it exceeds a threshold value the execution is interrupted and control

is passed back up the hierarchy to the top level without updating exit value functions.

7.3 Infinite Horizon Experiments

Modified versions of the taxi task from section 6.1 and the robot soccer player from

section 6.4 are used to provide empirical evidence that HEXQ is able to perform in

an infinite horizon setting and continue in a sub-task if required.

4Regions are not tested for potential combination as in section 5.2.2. The algorithm could beimproved by allowing regions to be combined if an exit is not forced on entry to the combinedregion. In the current implementation each strongly connected component is conservatively takento be a separate region.


7.3.1 Continuing Taxi

The taxi domain is modified by making the task continuing. The taxi is additionally

given the option of performing a local job that only requires the use of the navigation

sub-task. The objective is to compare the solutions found by HEXQ to the optimal

behaviour found by a flat reinforcement learner for various reward scenarios.

The taxi task is reproduced in figure 7.3. A taxi trans-navigates a grid world to

pick up and drop off a passenger at four possible locations, designated R, G, Y and

B. Actions move the taxi one square to the North, South, East or West, pickup or

putdown the passenger. Navigation actions have a 20% chance of slipping to either

the right or left of the intended direction. Generally all actions incur a reward of

−1. If the taxi executes a pickup action at a location without the passenger or a

putdown action at the wrong destination it receives a reward of -10. To make the

task continuing, another passenger location and destination are created at random

after the passenger is delivered and the taxi is mysteriously tele-ported to a random

grid location. The reward for delivering the passenger is 200.

The taxi domain is augmented with another source of income for the taxi. If

the taxi transitions from below to the grid location marked with the $ sign in figure

7.3, the reward is a positive number. This may be interpreted as a local delivery

run with guaranteed work but at a different rate of pay. The taxi problem can be

modelled as an MDP with a 2-dimensional state vector s = (taxi location, passenger-

source/destination). There are 25 possible taxi locations and 20 possible passenger-

source × destination combinations.

With all three value functions, pseudo, real and discount, HEXQ requires 25

states × 5 sub-MDPs × 4 primitive actions × 3 value functions = 1500 table entries

at level 1. At the top level there is one sub-MDP(semi-MDP) with 20 states ×9 abstract actions × 1 real value function (the others are redundant) = 180 table

entries, a total requirement of 1680 table entries compared to 3000 for the flat

learner. These storage requirements assume actions with no effect are eliminated


(section 5.7). The pseudo value function is not required for execution of a learnt

policy and can be discarded after learning, reducing the storage requirement for the

decomposed value function to 1180 table entries. For larger problems the savings

can become more significant.

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0.3500

0.4000

-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Local Positive Reward "$"

Ave

no

of

visi

ts t

o lo

c $

per

tim

este

p

HEXQ

flat

The Taxi Task

$

Figure 7.3: The infinite horizon taxi task. The graph shows that HEXQ finds andswitches policies similarly to that of a flat reinforcement learner for various values ofpositive reward at $. As the reward at $ increases the taxi stops delivering passengersand instead visits the $ location as frequently as possible. This provides confirmingevidence for the correct operation of HEXQ for an infinite horizon problem, evenwhen the optimum solution requires continuing in the non-exit navigation sub-MDP.

This problem has two optimal solutions depending on the value of the $ reward.

If the $ reward is low, the solution is to continue to pick up and deliver passengers as

per the original taxi problem. For larger $ values the taxi prefers to continue local

delivery runs and ignore the passenger. As the local $ reward is the same in the

context of any passenger pickup and destination location, HEXQ discovers one class

of region for navigation. Performing local deliveries means deciding to continue in

a navigation sub-task.

It is instructive to vary the local reward at location $ to see when the taxi decides

to perform local deliveries instead of picking up and dropping off passengers. The

local reward is varied from −1 to 19 in increments of 1 and the MDP solved both


with HEXQ and with a flat RL. The number of times the $ location is visited

per time step is counted as an indication of which strategy the taxi uses. A high

visitation rate indicates the taxi prefers the local delivery run.

A HEXQ automatic decomposition of the original episodic taxi task creates 4 sub-

MDPs at level 1 (see section 6.1.1). For the continuing taxi task we still only have

4 exit states at level 1, but an extra non-exiting sub-MDP is created as explained

previously. Figure 7.3 indicates the optimal policy chosen for each value of local

reward $. As the amount of local reward increases the taxi switches strategy from

delivering passengers to local delivery runs. The switch takes place when the local

reward is about 10 for both HEXQ and the flat learner. The error bars indicate the

maxima and minima over ten runs for each reward setting.

Within error bounds, both learners find similar solutions for this problem, pro-

viding confirming evidence that HEXQ can find the correct solution by persisting

in a sub-task if necessary.

7.3.2 Ball Kicking

The second example is the stylized bipedal robot introduced in section 6.4, that

learns to walk, kick a ball and score goals. This problem would be intractable with

a flat learner on present day desktop computers requiring nearly 3 billion Q values.

The experiment is designed to demonstrate the benefit of state abstraction in a

discounted setting.

The MDP state description contains three variables: the robot leg stance and

direction (384 values), the position of the robot relative to the ball (861 values) and

the position of the ball on a soccer field (400 values). 21 primitive actions allow

the robot to move the legs and change its direction. The robot leg positions, soccer

field and ball behaviour is shown in figure 7.4 (a) and (b). The reward on each state

transition is −1 and a trial terminates when a goal is scored. When the robot runs

into the ball, the ball is kicked stochastically but roughly in the direction the robot


is moving. HEXQ automatically generates a 3 level hierarchy. A similar problem

is described more fully in section 6.4. The primary interest is to test HEXQ when

positive reward is introduced. In separate experiments, positive reward is introduced

to

1. reward the robot for running on the spot and

2. reward the robot for running around the ball.

The significance of these conditions is that recursively optimal solutions can only be

found by the robot continuing in a level 1 sub-task and level 2 sub-task respectively.

3 12

4

(a) (b)

(c)

7.87.8

12.27.8

12.27.8

12.27.8

12.2

7.812.2

12.217.0

17.021.8

17.021.8

17.021.8

12.217.0

17.021.8

21.826.8

26.8

26.826.8

12.212.2

17.017.0

21.821.8

26.821.8

26.821.8

7.812.2

17.017.0

21.821.8

21.817.0

7.87.8

12.212.2

17.017.0

17.012.2

7.87.8

12.212.2

12.27.8

7.87.8

12.212.2

17.012.2

17.012.2

17.012.2

17.012.2

12.27.8

7.8

7.8

7.8

7.8

7.8

7.8

7.8

(d)

Figure 7.4: The soccer player showing (a) the simulated robot leg positions, (b) 400discrete ball locations on the field, (c) the discounted value of states in the level 2no-exit sub-MDP when the robot is rewarded for running around the ball and (d) asnapshot of the robot running around the ball.

Using a learning rate of 0.25, a discount rate of 0.99 and an ε-greedy exploration


policy for the top level sub-MDP with ε = .5 for the first 500 trials, HEXQ generates

a total of 17 sub-MDPs to decompose the problem. At level 1, 6 sub-MDPs deter-

mine the one step walk directions plus one no-exit sub-MDP. At level 2, unlike in

section 6.4, there are two regions, one with 860 states and the other with one state

when the agent is at the ball location. The first region has 7 sub-MDPs representing

the 6 directions from which to approach the ball and one no-exit MDP. The single

state region has two vestigial sub-MDPs, one allowing abstract actions to kick in

one of six directions and a no-exit policy. There are two regions as a direct result

of conservatively taking strongly connected components as regions. They would be

merged into one region with the suggested improvement in the footnote of section

7.2.

After exploring the value function for scoring goals, the agent decides to continue

in the no-exit level 1 sub-MDP running on the spot for case (1). For case (2) the

agent continues in the level 2 no-exit sub-MDP for the larger region, running around

the ball as shown in figure 7.4 (d). In this case it invokes terminating sub-MDPs

from level 1 for locomotion. Figure 7.4 (c) shows the discounted value function for

a portion of the no-exit sub-MDP for the larger region at level 2. Six maximum

value states determine the circling policy. For this sub-MDP policy the sequence of

primitive rewards is 4, -1, -1, -1, 4, -1, -1, -1, 4, -1, .... Executing its stylised walking

gate, the agent makes 4 leg movements to take one step. The reward is 4 when it

changes position relative to the ball and -1 otherwise. With a discount factor of .99

the E value of each state in the final cycle is the discounted sum of the above reward

sequence and is equal to 26.8 truncated to one decimal place.

HEXQ solves this problem in about a minute, on a 400MHz Intel desktop com-

puter, with theoretically over four orders of magnitude saving in space requirements

as a result of state abstraction made possible with the additional action discount

function. The results support the expected operation of the HEXQ algorithm to au-

tomatically decompose and solve an infinite horizon MDP with positive and negative


rewards.

7.4 Conclusion

The hierarchical reinforcement learners MAXQ and HEXQ employ a hierarchically

decomposed value function. When the value function is represented by the dis-

counted sum of future rewards it is not possible to directly abstract sub-task state

values because the amount of discount to apply to the completion or exit values

is not known. The introduction of a separate on-policy discount function for each

sub-task has solved this problem for HEXQ and conditionally for MAXQ. The in-

troduction of non-exiting sub-MDPs that are interrupted during learning and the

extension of on-policy pseudo reward functions with large positive rewards on exit

has allowed HEXQ to automatically decompose and solve infinite horizon MDPs

with positive rewards.

The discount function gives the amount of discount to apply at termination or

exit for each state in the sub-task. In HEXQ it is possible to safely abstract the exit

values for all states of a sub-MDP because exits are unique.

A decomposed discounted value function and state abstraction extends the effi-

cient application of this type of hierarchical reinforcement learning from stochastic

shortest path problems to infinite horizon MDPs.

Chapter 8

Approximations

In many practical situations an approximate solution is all that may be required. Ap-

proximations can often reduce computational complexity significantly. This Chapter

reports on preliminary work using approximations in the application of HEXQ to

problems where some additional background knowledge is assumed. This can lead

to savings in storage and computation time over and above that already achieved

by HEXQ, while retaining the ability to automatically construct a hierarchy. In

some cases, approximations can improve the policy found by HEXQ or even make a

problem HEXQ decomposable when the problem was not decomposable beforehand.

Section 4.4 illustrated the compact representation of a HEXQ tree by a directed

acyclic graph using safe state abstractions. The HEXQ value function for a hier-

archical policy is exactly the same before and after state abstraction. This section

introduces approximations to a decomposed value function that go beyond safe state

abstraction of a hierarchically optimal value function. Two types of approximations

are explored in relation to stochastic shortest path MDPs

• variable resolution approximations,

• stochastic approximations.

163

Approximations 164

8.1 Variable Resolution Approximations

Not only do humans have the ability to form abstractions, but they control the

amount of detail required to represent complex situations and to make decisions.

To decide the best way to travel to a conference, for example, the first component of

the decision may involve the available modes of transport; car, bus, aircraft or train.

The final decision may even take into consideration connections at either end for each

mode of primary transport. However, the way to leave the home (front door, back

door or garage door) is unlikely to be an influencing factor in the decision, although

the final execution of the plan will use one of the doors to exit the home. Planning

the whole journey to the lowest level of detail is unnecessary and an approximation

limited to the first two levels may be sufficient. How could a reinforcement learner

model these pragmatics?

The HEXQ decomposition produces a hierarchy of abstract models with increas-

ingly finer resolution at successive levels further down the hierarchy. For example,

in the simple maze from Chapters 1 and 4, the top level models the coarser grained

inter-room navigation, whereas the bottom level represents the higher resolution

intra-room navigation from position to position. In the Towers of Hanoi problem,

higher levels in the hierarchy model the movement of larger discs. The details of

moving the smaller stack of discs, that sits on top of the larger discs, is left to the

lower levels in the hierarchy.

A natural approximation at each level of a hierarchy is to ignore the details at

the lower levels in the hierarchy. The depth in the hierarchy to which the details

are taken into account can be varied, providing a mechanism whereby the accuracy

of the approximation can be controlled. A number of heuristics are presented to ap-

proximate the decomposed value function both at construction time of the hierarchy

and during final execution of the hierarchical policy.

Approximations 165

8.1.1 Hierarchies of Abstract Models

The HEXQ decomposition produces a hierarchy of abstract models at progressively

finer resolutions.

Consider a HEXQ decomposition to level e with abstract states se ∈ Se. The

Cartesian product of Se and the rest of the frequency ordered variables Xe+1, . . . , Xd

from the original MDP, form a partition Ge over the base level states at level e.

Ge = Se ×Xe+1 ×Xe+2 × . . . ×Xd (8.1)

Under this interpretation, the HEXQ decomposition produces a series of progres-

sively coarser partitions, Ge, at each successive level. Each Ge is a quotient partition

with respect to the next lower level e− 1. The blocks of partition Ge represent the

states of the abstracted SMDP at level e. The HEXQ decomposition of the original

MDP is hence modelled at progressively finer levels of resolution proceeding from

the top to the bottom level of a HEXQ hierarchy.

The HEXQ decomposition provides a structure and mechanism to approximate

solutions to various levels of detail. Recall that the decomposed hierarchically op-

timal value function for the abstract SMDP at level e (equation 4.15 from Chapter

4) for state s and ge ∈ Ge, where (abstract) action a invokes sub-MDP ma is

V ∗e (s) = max

a[V ∗

ma(s) + E∗

e (ge, a)] (8.2)

Before proceeding to discuss a number of variable resolution approximations, an

example is introduced to facilitate the discussion.

Figure 8.1 is the plan view showing two of 10 similar multi-room floors in a

multi-storey building connected by two (East and West) banks of elevators. The

state of this MDP is described by three variables; floors (values 0-9), rooms (0-24)

and positions in room (0-8). An agent uses one step actions to move North, South,

East, West, up or down and starts on floor 0. The goal is to move to position “G” on

Approximations 166

Floor 0

elevators

G

...

Floor 1

Figure 8.1: The plan view of two out of ten floors connected via two banks ofelevators.

floor 1 and execute the up action (to say turn the coffee maker on). The up/down

actions only work at the elevator where they move the agent one floor at a time.

The reward is −1 for taking each action.

The HEXQ hierarchical decomposition for the multi-storey building MDP aggre-

gates room locations into a single state with 8 abstract actions, 4 to leave the room

via the North, South, East or West and 4 to execute up or down at each elevator or

“G” position. The top level SMDP has only 10 abstract states, one for each floor,

and 9 abstract actions, moving to each elevator and pressing up or down and moving

to the coffee machine and pressing up. The storage required to compute the value

function for the original MDP is 13,500 values (9 positions × 25 rooms × 10 floors

× 6 actions). The HEXQ decomposed model requires 1,3061, an order of magnitude

saving.

Three variable resolution approximations to reduce the computational complex-

1these values are calculated using equation 5.7.2 from Chapter 5

Approximations 167

ity beyond that of safe state abstraction are introduced. The work here is of a

preliminary nature and approximations are suggested as a future research topic.

8.1.2 Variable Resolution Value Function

The overall value function is composed from the value functions of sub-MDP com-

ponents at each level in the hierarchy. Even after learning is complete, execution of

a hierarchically optimal policy is not just a matter of using a simple lookup table

as for a flat reinforcement learner. HEXQ uses a best first search2 as a result of

the recursion in equation 8.2 to decide the next best abstract action. These deci-

sions are made when invoking sub-MDPs during learning, and after learning on final

execution of the decomposed problem.

Depth bounded search is a well known technique in problem solvers (Russell and

Norvig, 1995). Limiting the depth of the search in this case can approximate the

value function. The approximation is accurate to the extent that either:

• there is a diminishing return (or loss) in the finer detail of sub-tasks and their

contribution to the overall value function can be ignored or

• sub-tasks accumulate near constant internal reward during execution. In this

case the reward on exit can be made proportional to the accumulated reward

estimate within a sub-task plus exit, with the effect that there is no further

contribution to the value function from lower levels.

For example with a depth limit of zero, representing the most abstract approxi-

mation, equation 8.2 becomes

V ∗(g) = maxa

E∗(g, a) (8.3)

reducing nicely to the usual “flat” Q learning formulation for the problem3.

2as does MAXQ (Dietterich, 2000)3for HEXQ, with only one variable in the state description of the original MDP or for the first

Approximations 168

The depth first search in equation 8.2 has a time complexity of O(∏

d bd) where

bd is the branching factor or number of abstract actions at each level d, down to the

leaves of the HEXQ graph. The time complexity for a depth zero search is O(∑

d bd),

the sum of the number of abstract actions applicable at each level.

Varying the search depth may give better approximations at the expense of a

greater execution cost. By making the value function search depth variable, the

speed of decision making in HEXQ can be offset against the quality of the policy

during execution, providing an anytime execution procedure for HEXQ.

It is important to note that limiting the search to a particular depth does not

affect the ability to operate at more detailed levels. Limited depth resolution op-

erates at each level of the HEXQ hierarchy. Recall that in HEXQ the levels are

numbered from the bottom up. For example, at the highest level, level 4, a depth of

2 searches down to level 2 and at level 3 the search extends down to level 1. HEXQ

is still able to implement fine grain policy because when abstract actions at lower

levels are evaluated, they in turn resolve the local problem to their own set depth

limit.

In the multi-storey building example, limiting the value function to top level

values would result in choosing an arbitrary elevator to reach the first floor. In this

case the agent may lengthen the journey by not travelling to the nearest elevator

bank. Increasing the depth of search to one level may result in a shorter path, as

the distance to each elevator bank is included in the decision. For the Towers of

Hanoi puzzle in section 6.3, constraints ensure that the cost of an abstract action

is constant at each level (given the disks are stacked on one peg to this level). This

means that a depth zero search is sufficient to ensure an optimal policy, reducing

the number of values that need to be searched, at the top level in the 7 disk version,

from 67 = 279, 936 to only 6×7. A more than 3 orders of magnitude saving! This is

level of the hierarchy, the exit value function E is identical to the normal Q action value functionin reinforcement learning. This consistency is another motivation for the particular choice of thedecomposition in the way reward on exit is treated.

Approximations 169

a significant reduction in decision time complexity for choosing the next best action.

This means that HEXQ executes the hierarchically optimal policy (and in this case

the globally optimal policy) by simply taking the (abstract) action at each level that

maximises the exit value for that level only. Only seven decisions are required to

reach level one and take a primitive action.

In applying this approximation one needs to be careful. It is easy to construct

problems where the cost to goal is further from optimal when ignoring lower level

contributions. Nevertheless limiting the depth of search for the decomposed value

function is worth considering. This approximation can also be used with MAXQ

and provides one answer to the high cost of the value function depth-first-search

flagged by Dietterich (2000).

8.1.3 Variable Resolution Exit States

The number of region exits greatly effects the efficiency of HEXQ as it determines

the number of sub-MDPs that are generated and the number of abstract actions

that are explored at the next higher level. Savings can be realised by combining

exits that have similar exit states.

The HEXQ decomposition generates one region sub-MDP for each hierarchical

exit state. Recall from section 5.3 that a hierarchical exit state means that the exit

state is at full resolution in that it is not abstracted and therefore distinguished by

all the variable values along the path down to the base level of the hierarchy. This

was necessary to ensure safe state abstraction as the internal value to reach an exit

may vary for different base level states. The resolution of a state refers to the level

in the HEXQ hierarchy at which the state is represented abstractly. Limiting the

resolution of the exit states may reduce the number of sub-MDPs required. The

effect is to reduce the number of separate policies for each region at the expense of

possibly reaching each exit sub-optimally. The benefit is to reduce space complexity

and learning time complexity. The loss of resolving exit states will in general lead to

Approximations 170

an overestimation of the value function as some negative rewards within the abstract

exit state may be ignored.

For the multi-storey building example, consider the floor region defined by the

25 abstract room states. Separate sub-MDPs for this region are required for each

of the four elevators, despite each of two elevators sharing the same room. If we

generate only one sub-MDP per elevator bank, that is, per abstract room state, the

storage requirements to represent the value function are reduced from 1306 to 906.

In this example there is no loss in accuracy of the value function, but in general the

loss is limited to the internal difference in state values, at the limit of resolution, in

this case the intra-room distance.

8.1.4 Variable Resolution Exits

In MAXQ, goal termination predicates are defined by the user. Each termination

predicate could combine many exits discovered by HEXQ. As HEXQ automatically

constructs the hierarchy, the question is: is it possible to find criteria for automati-

cally combining exits?

To allow safe state abstraction, HEXQ generates one abstract action for each

region exit. In turn, each abstract action needs to be represented and explored at

the next highest level in the hierarchy.

In flat MDPs, if two actions produce identical state transitions and rewards

from all states, then these actions may be combined into one and the problem

reduced without detracting from the optimal solution. This motivates the heuristic

for HEXQ to combine exits from a region when they always transition to the same

next region.

Combining exits makes it easier for some sub-MDPs to exit as the choice of exit

has increased. This will in general increase the internal value function of a sub-MDP

as it is less costly to reach the closest of the combined exits. On the other hand

there is an increased loss of control as exits cannot be discriminated at the next

Approximations 171

level. The net effect on the value function and resultant policy depends on each

specific problem instance and needs to be tested. It should be noted that sub-MDP

termination, with combined exits, may no longer be totally independent of the entry

state. Nevertheless, combining exits may still provide a good approximation to the

safe state abstraction produced by HEXQ.

For the multi-storey problem, the two floor exits going up for each elevator bank

may be combined as they always lead to the same next state (the room on the floor

above). Similarly the down exits may be combined. The effect of combining elevator

exits is to reduce the storage space requirements from 1306 to 866.

As another illustration, consider the maze introduced in Chapter 1, this time

with the wider doorways shown in figure 8.2. With deterministic actions HEXQ

would generate 3 exits per doorway and 12 exits per room altogether. This would

result in 12 sub-MDPs and 12 abstract actions at the room level of abstraction.

If the problem is modelled at the resolution of the rooms only, the three exits per

doorway would be indistinguishable. Each would transition between similar rooms

with the same reward. Using the above heuristic, these exits could be combined into

one.

Goal

Figure 8.2: The simple room example with wider doorways.

Since only one sub-MDP would now be generated per doorway, the agent, having

Approximations 172

decided on the doorway to use, would leave the room by the nearest of the three

exits that were combined to form the one doorway exit. If the value function is

evaluated to full resolution then safe state abstraction could no longer be guaranteed

because the E action value function on exit would now depend on the agent’s starting

position in the room. An average E value for the three exits would result based on

the frequency that each type of exit was experienced.

Combining exits with similar abstract state transitions and reward functions

requires that HEXQ first identifies the exits and explores them at the next level

before they can be combined in the previous level. This means that HEXQ needs to

backtrack during hierarchy construction making the algorithm more complex. Back-

tracking may well become a feature of hierarchy construction as will be suggested

in section 9.6 to achieve more flexible and robust learning.

The idea can again be generalised to a variable resolution of abstract states.

Exits are combined if they have identical transition and reward functions for all

hierarchical exit states down to the same level of resolution.

There are other conditions for combining exits that suggest themselves. For

example, if the exit values E for a set of exits are the same for any abstract state se,

i.e. ∀ i, j, k : E(sei , a

ej) = E(se

i , aek), then the exits can be combined and treated as

one abstract action requiring only one sub-MDP. This type of combination will still

represent the value function losslessly and may improve the hierarchically optimal

solution. The condition that E values are the same may be generalisable to a small

difference between E values. Approximations to these conditions would also be a

good heuristic for combination exits.

A potential application of exit combination is the Ball Kicking problem from

Chapter 6, extended with another variable that determines the goal direction. Ab-

stract actions that reliably kick into one of the goals could be combined. With the

ball location variable specified relative to the target goal, the agent would then not

have to learn to kick to each end of the field separately.

Approximations 173

8.2 Stochastic Approximations

There are stochastic MDPs that look like they should decompose easily, but HEXQ

is unable to decompose them because of the strict requirement that the decomposed

value function is lossless. An example is the simple maze from Chapter 1, when the

actions are stochastic in the sense that say 70% of the time they move the agent in the

intended direction and 10% of the time in each of the three other directions. HEXQ

cannot decompose this problem in any useful way, as exits through a particular

doorway cannot be guaranteed and the reachability condition of the HEXQ partition

is violated.

It is possible for some problems to use a deterministic MDP policy to approximate

a stochastic MDP policy, if environments can be identified where the stochasticity

does not cause undue adverse effects. Some stochastic problems that produce poor

automatic decompositions may produce better decompositions with deterministic

actions. This means that approximate, but compact, solutions can be found with

HEXQ to these problems.

An example of a benign environment is a navigation problem where the error

distribution following each action is symmetrically distributed about the direction

of travel, it is always possible to reverse the direction and the reward is constant per

transition. A useful future research topic would be to formally state conditions and

possibly error bounds for MDPs, to approximate a stochastic MDP policy with a

deterministic MDP policy. Liptser et al. (1996), for example, show that for certain

continuous problems, when the noise “intensity” is small and the fluctuations become

fast, the stochastic problems can be approximated by a deterministic one.

Without a model, the deterministic approximation to stochastic actions may not

be available. In this case it may be possible to use the transition with the highest

probability or payoff. In Euclidean metric spaces the transition closest to the sample

average of the next states may be a good candidate.

When benign conditions prevail, the HEXQ decomposition is particularly en-

Approximations 174

O O O

O O

O O O

OO

O O

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6 7 8 9

Figure 8.3: Kaelbling’s 10 × 10 navigation maze. The regions represent a Voronoipartition given the circled landmarks.

couraging because the decomposition with deterministic actions has been proven to

be globally optimal (section 4.3.1). The deterministic MDP policy, as an approxi-

mation to the stochastic problem, should therefore produce good solutions. In the

worst case, a deterministic MDP policy may be arbitrarily bad or fail to reach an

exit altogether.

As an example, consider a maze task introduced by Kaelbling to show how aiming

for landmarks can reduce the computational complexity of an MDP by sacrificing

global optimality. In the maze (figure 8.3) an agent must learn to navigate from a

random start location to a random goal location by the shortest route on the 10×10

grid. After reaching the goal the problem terminates and a new starting position

and goal are chosen at random. The actions available to the agent are to move one

square to the North, South, East or West. The interesting feature about this maze

is that each location is itself a goal.

To solve this maze it is necessary to solve the equivalent of 100 MDPs, one for

each goal location. The size of the state space is therefore 100 locations ×100 goals

= 10,000 states.

In Kaelbling’s maze the agent is assisted by 12 fixed landmarks (indicated by

Approximations 175

circles in figure 8.3) spread arbitrarily throughout the grid. The locations are parti-

tioned into Voronoi regions indicating their closest landmark. In a stochastic setting,

Kaelbling’s error distributions on each action are as follows: the agent lands in the

intended cell 20% of the time and 20% of the time in each of the 4 neighboring

cells. If the actions move the agent outside the boundary, the agent is moved to the

closest internal location.

This task is represented as an MDP to HEXQ, with the states described by a

vector of 3 variables, namely: agent-location (100 values), goal-location (100 values)

and closest-landmark (12 values). A dummy action is introduced to allow the agent

to signal arrival at the goal. Primitive rewards are −1 for each action.

HEXQ cannot decompose the stochastic version of this problem. The reason

is that the reachability condition of the HEXQ partition cannot be satisfied. In

any sub-region of the maze the agent may not reach the intended region exit with

probability one because the stochastic drift may cause it to leave at any boundary

location.

The whole point of Kaelbling’s original paper was to show how, using landmarks,

it is possible to learn an approximate navigation policy to move to any goal at a

small cost from optimality efficiently. The basic idea in HDG is to navigate at the

abstract level of landmarks and only make local decisions at the primitive higher

level of resolution.

HEXQ can solve this problem using the approximation discussed above. Having

automatically decomposed and solved the deterministic 10 × 10 maze optimally,

the solution to the stochastic problem is to execute the deterministic MDP optimal

policy using hierarchical greedy execution. A hierarchical policy will lock in subtasks

until they exit. When actions are stochastic it is possible that the agent will drift

to locations where the sub-task policy is no longer appropriate. Hierarchical greedy

execution continuously interrupts each sub-task execution and recalculates the top

level value function to find the next hierarchically optimal primitive action.

Approximations 176

5

7

9

11

13

15

17

19

0 50000 100000 150000

Trials

Ave

rag

e S

tep

s p

er T

rial

flat

HEXQ

Figure 8.4: Performance of HEXQ on stochastic version of Kaelbling’s maze. Stepsper trial are collected in buckets of 10 samples over 10 runs. The trend lines are 255point moving averages.

Figure 8.4 shows the detail of the performance of a flat reinforcement learner and

HEXQ for the stochastic version of the 10× 10 maze with hierarchical execution of

the previously found optimal policy using deterministic actions. HEXQ completes

construction of the hierarchy and finds an optimal policy by about 30,000 trials.

The step up in the graph for HEXQ at 50,000 trials is where stochastic actions

are switched in. The flat RL uses a learning rate of 0.25. From value iteration we

know the optimal performance is 9.24. HEXQ using the stochastic approximation

achieves 10.1 average steps per trial. The flat learner achieves 10.1 after about

200,000 trials and would presumably converge to the optimum with further trials

and with a gradual reduction in the learning rate4.

Stochastic transitions may thus be approximated by deterministic transitions in

some forgiving problems allowing HEXQ to decompose the problem.

4see appendix B for further comments

Approximations 177

8.3 Discussion and Conclusion

As discussed in section 4.3, a hierarchical representation usually constrains the set

of policies available for the original problem. In hierarchical reinforcement learners

such as MAXQ, HAMQ and HEXQ there are no guarantees in general that the

hierarchically optimal policy is close to the global optimal policy. In this sense a hi-

erarchy, whether user defined or generated automatically, generally already provides

only an approximate solution to the original problem.

This Chapter has introduced extensions to the basic HEXQ algorithm that allow

a decomposed value function to be represented approximately. The appropriateness

of each approximation is problem dependent, but the accuracy may be controlled

by varying the resolution of the model. This provides an anytime mechanism for

learning and execution trading off decision quality against time.

Many questions are left unanswered, such as clear statements for the conditions

under which the various approximations will work and precise error bounds. It may

be possible to learn the most appropriate depth of resolution at each level of the

hierarchy. These and other questions are left for future research.

Chapter 9

Future Research

The objective of this thesis has been to discover hierarchical structure in reinforce-

ment learning. The key idea embodied in the HEXQ algorithm is to find reusable

sub-tasks and abstract them to form a reduced model of the original problem that

requires significantly less space and time for its solution.

This Chapter will discuss various directions for future research that builds on this

foundation, removing assumptions, increasing the scaling potential of hierarchical

reinforcement learning and moving towards more complex, realistic environments.

9.1 Discovering Exits

One of the assumptions made by HEXQ is that all exits are discovered at each level

using a random search. For the problems in this thesis the exploration time was

set manually to ensure that this was the case. As explained in section 5.2, missing

important exits may result in a poorer solution or failure to find a solution.

The 25 rooms experiment in section 6.2 highlighted the weakness. The heuristic

to discover exits by testing temporally close samples of transitions, failed to find all

the exits for some stochastic specifications of the problem. Other statistical tests,

such as the Chi squared test, have been tried, but the performance of the heuristic

depends very much on the transition structure of the problem. If the structure is

178

Future Research 179

such that a temporal sequence of samples under a random action policy moves the

agent to different locations in the state space, then the distribution of the temporal

sample may not be able to discover an exit in a particular part of the state space.

There are several solutions to this issue. Of course, if the original problem is

known to be Markov, then it is tempting to collect sample data for each lower level

transition probability distribution in the context of the remaining variables. These

distributions are certain to be stationary by assumption. It is an easy matter to

test for exits to any level of significance by collecting enough data and comparing

distributions. The problem with this approach is that the number of contexts can be

exponentially large, depending on the variables still to be processed in the ordered

state vector. The “curse of dimensionality” is transferred to the enumeration of the

contexts and this is impractical for large problems.

One solution is to test for unusual circumstances at a lower level of significance

and tag these situations as specific conditions under which to collect more reliable

statistics before declaring exits. In this way only a practical selection of contexts

are tested and the test can be conducted to any level of significance by simply

collecting enough data. Humans seem to use this method. For example, if an

observation is made that seems unusual, attention is focussed on that particular

circumstance until it is decided that it was either a coincident or a new phenomenon.

Unusual circumstance may, for example, still be identified by testing a temporally

close sample, but this time the lower level of significance of the test will only flag

situations for further testing. The key is not to commit to these potential exits until

they are confirmed by testing in their specific contexts.

A further extension is to be able to add exits belatedly. Some exits may be hard

to find or may not be needed to solve the problem. If exits could be accommodated

at any time during hierarchy construction, this would assist in two ways:

• The HEXQ time complexity can be reduced by searching for exits concurrently

with region discovery at the next higher level.

Future Research 180

• An impasse or aberration discovered at a higher level due to a missed exit at

a lower level can be repaired.

More recently Potts and Hengst (2004) have shown how learning can be speeded

up further by allowing exits to be concurrently discovered and assimilated at multiple

levels.

9.2 Stochasticity at Region Boundaries

HEXQ will only find meaningful decompositions if regions can be found such that

their exit states can be reached with certainty. Even when this is the case, the

constraint on policies inherent in a HEXQ hierarchy means that low cost region

leaving policies may be missed, as was illustrated in section 4.3. This raises the

issue of whether the stochasticity can be better handled at region boundaries.

One answer is to combine exits as suggested in section 8.1.4. While this may

improve the value function locally for a reusable subtask, it gives up the opportunity

to choose a specific exit in each particular instantiation of the subtask. In HEXQ it

also introduces an approximation for the hierarchical policy, as the exit transition

may no longer be independent of the way the region was entered. Having said that,

the overall solution may improve depending on the tradeoff in the ease of exiting

against the precision of exit control required. For example, exiting a multiple state

wide doorway may be easier when the exits states are combined, but if exiting other

than via the central state leads to injury then the combined exit is not such a good

idea. This tradeoff is problem dependent and may be able to be learnt.

Another interesting idea is to explore the general possibility that stochasticity

at lower levels in the hierarchy can be contained and controlled when constructing

models at higher levels. If a robot navigates through rooms, its stochastic actions

may cause it to slip. Sensor readings, indicating its location, will have errors. The

probability of occupying a particular room, however, is less error prone. Finney,

Future Research 181

Hernandez, Oates, and Kaelbling (2001) suggested similar research objectives.

Other related work are the findings by Lane and Kaelbling (2002) in which care-

fully constructed macros “hide” navigational stochasticity to produce a deterministic

planning problem.

In contrast to the partitioning employed by HEXQ, Moore et al. (1999) use

overlapping regions in the “Airport Hierarchy” to decompose the navigation space

(see section 3.5.5). This suggests another approach to automatically decomposing

MDPs to contain stochasticity. The abstract model of moving between overlapping

regions will exhibit less stochasticity than that for the base level states. If repeat-

able overlapping Markov sub-space regions can be found, then transitions between

regions may be approximately deterministic. In the multi-room problem, section

6.2, for example, the doorways between rooms are similar. A region defined around

a doorway would allow stochastic exits to be modelled and contained, avoiding the

issue of combining exits. The agent navigates with more certainty from the center

of the room to the doorway with control passed from region to region in the middle

of their overlap.

9.3 Multiple Simultaneous Regions

HEXQ orders the variables of the MDP by frequency of change. This ordering is an

effective but crude heuristic to construct a hierarchy. The heuristic hopes that it can

uncover variables that have different temporal frequencies of response. Each level

in the HEXQ hierarchy is associated with one variable producing a “monolithic”

hierarchical structure.

A more efficient decomposition may be to have HEXQ find separate partitions

for all the remaining variables at each level in the hierarchy. In this case, the same

exploration effort may uncover multiple abstractions concurrently.

Consider the simple room navigation problem in figure 9.1. In this episodic

Future Research 182

4

3

2

1

0

0 1 2 3 4

Y

X

goal

Figure 9.1: A simple navigation problem to leave a room. The agent uses one stepmoves in 4 compass directions to reach the goal state from any starting position.

problem there are 25 states, each labelled with their x and y coordinates. The agent

can deterministically move one step North, South, East and West. It receives a

reward of -1 per step. The objective is to reach the goal, whereupon the problem

terminates and the agent is restarted at random somewhere on the grid.

Ordering the x and y variables in order of frequency, HEXQ will choose one

randomly above the other. If the first variable chosen is y, then HEXQ will find

one region at level 1 in the hierarchy containing the 5 states defined by the values

of the y variable. The actions North and South allow navigation within the region

and East-West actions are exit actions. Level 2 in the HEXQ hierarchy will have 5

abstract states and the problem will be solved by navigating to the abstract state

associated with x = 4 and taking abstract action (y = 4, east).

Searching for Markov sub-space regions for each of the two variables simultane-

ously will produce two different partitions of the state space, one for each variable,

as depicted in figure 9.2. At level 2 in the hierarchy, abstract states are defined

by the Cartesian product of the two region labels. Exits are tagged with the other

variables they change. Combining regions at the next level, the only exits retained

Future Research 183

0 1 2 3 4

X region0

1

2

3

4

Y region

0

goal

goal

(x=4, y=4, east)goalLevel 2

Level 1

Figure 9.2: New hierarchical structure for the simple navigation problem using mul-tiple partitions at level 1.

are ones with the same exit action that change the same variables. Other exits can

be discarded, as they are absorbed within the other region. Combination produces

only one abstract state and one compound abstract action (x = 4, y = 4, East), at

level 2, leading to the goal. The others cancel out. The execution of this type of

compound abstract action would move x to position 4, y to position 4 and perform

primitive action East.

This simple problem shows the extra compaction potential from multiple par-

titions. The flat problem requires 25 states × 4 actions = 100 Q values. HEXQ

require 5 states × 4 actions × 5 exit states + 5 states × 11 exits = 155 E values.

Using simultaneous regions would require 5 states × 4 actions × 2 regions + 1 state

× 1 exit = 41 E values.

9.4 Multi-dimensional Actions

Complex actions may be represented in factored form. Multi-dimensional actions

are interpreted as concurrent along each dimension. For example, speaking, walking

and head movements may be represented by three action variables. The action space

Future Research 184

is defined by the Cartesian product of each of the variables.

Decomposing a problem by factoring over states and actions simultaneously re-

sults in parallel hierarchical decompositions. Parallel decomposition refers to an

MDP broken down into a set of sub-MDPs that are “run in parallel” (Boutilier

et al., 1999). Factoring over actions alone has been considered by Pineau et al.

(2001).

The example from the above section can be used to illustrates the point. If

the actions are 2-dimensional with one action variable having values {North, South,

Stay} and another with values {East, West and Stay}, it would be possible to take

simultaneous actions, such as (North, West) or (Stay, East).

Unfortunately, the problem does not decompose, because moving in two direc-

tions simultaneously changes both the x and y variables. However, applying the

action variables separately for each state variable discovers the same decomposition

as in figure 9.2. In this case the North-South-Stay actions are associated with the

y state variable and the East-West-Stay actions with the x variable. The hierar-

chy is now interpreted to be able to execute both regions simultaneously, effectively

moving the agent diagonally towards the goal. It is unclear how a value function

would be decomposed for these parallel hierarchical decompositions if rewards are

not independent and cannot just be added.

9.5 Default Hierarchies

The declaration of an exit by HEXQ indicates that a transition has been found at a

particular level that cannot be explained (represented by a stationary probability).

When HEXQ discovers a new exit it generates a new abstract action. The abstract

action is explored for all values of the next state variable. It can often happen that

the exit represents an unusual situation that is unlikely to repeat.

Figure 5.3 in Chapter 5 gives an example. The corner obstacle may only appear in

Future Research 185

one room, yet, once exits are discovered as a result of the obstacle, it is hypothesised

that they might occur and and are tested in all the other rooms.

If these rare situations can be detected, it would be possible to isolate them

to particular contexts and learn sub-MDPs separately for those contexts only. For

the above example, a new set of sub-MDPs would be learnt to exit the room con-

taining the obstacle, but no abstract actions associated with the obstacle would be

generated. The hierarchy would then be required to learn the default condition for

switching in the obstacle-in-room policy cache.

The idea of default hierarchies comes from Holland (1995) where they play a

similar role in extending models by noting rule exceptions to more general cases.

9.6 Dynamic Hierarchies

There are many reasons why initial hierarchical models may require adjustment or in

fact major revision as the agent learns. An agent may experience atypical situations

early in its life, be misled by a noisy environment, train on poor concepts, or be

shown a better policy.

An agent should be able to discard exits that have low utility or incorporate

newly discovered ones. For these cases it is necessary to backtrack levels in hierarchy

construction. To implement this flexibility in HEXQ, it is necessary to continue to

monitor transitions at lower levels. When new exits are discovered, additional sub-

MDPs and abstract actions are introduced. These require exploration at higher

levels and may result in the revision of partitions. When rare exits are discovered it

may not be worth revising a whole body of previously found knowledge but rather to

use default hierarchies, discussed previously, to isolate the exceptional circumstances

in which the exit occurred. If enough evidence accumulates, a more drastic revision

may be necessary. Practice may uncover shortcuts or not require some previously

generated abstract actions. In this case policies should be allowed to atrophy and

Future Research 186

release their storage requirements for reuse in other areas.

An issue worth exploring further, is the efficiency of concurrent layered learning,

in which lower levels are allowed to keep learning while simultaneously training

higher levels (Whiteson and Stone, 2003). This is a feature of MAXQ. HEXQ

continues to refine the lower level policies while constructing the higher levels, but

not before all the exits are deemed to have been found.

Concurrent HEXQ (Potts and Hengst, 2004) is a step towards making the hier-

archy dynamic. It not only learns concurrently at multiple levels but also constructs

the hierarchy concurrently at multiple levels. Results for one example show concur-

rent HEXQ learning more than an order of magnitude faster than HEXQ which has

itself reduced the learning time over a flat learner multiple times.

Dynamic hierarchies are also envisaged to allow flexible adaption to changing

circumstances. When an agent is placed in a new environment it would be desirable

for it to adapt and not relearn every sub-task. In this case it will need the ability

to retain some concepts while modifying others.

It is likely that in time, for an agent working in the one environment, base level

concepts will not change and become crystalised as higher level concepts are built

on their foundation. The process of crystallisation is consistent with Utgoff’s notion

of the “frontier of receptivity” (2002).

9.7 Selective Perception and Hidden State

Future research discussed so far is designed to improve the HEXQ decomposition

in various ways, predicated on the assumption that the problem is specified as an

MDP. The next sections will propose future research that does not necessarily make

the assumption the state vector and actions are specified so as to give rise to Markov

transition and reward functions in a straightforward way.

HEXQ relies on an MDP specification as a starting point. One future research

Future Research 187

direction is to attempt the automatic decomposition of a sensor state that may con-

tain redundant variables or one that does not exhibit the Markov property. Figure

9.3 is used to illustrate the challenges and possible approaches.

Sensor = ( x, y, z )Effector = ({E, W, Stay},{N, S, Stay})

4

3

2

1

0

Y

0 1 2 3 4

X

goal

0 1 2 3 4

01

2

3

4

Z

Figure 9.3: A partially observable environment with two independent tasks and2-dimensional actions.

This example shows two rooms of the type in figure 9.1 joined via a doorway.

The position in each room is described by the (x, y) coordinate, except that the x

variable values repeat in each room making the environment partially observable.

The agent cannot distinguish its unique position from the x and y variable labels

alone. The actions are factored by North-South and East-West variables making it

possible, for example, to move North and West simultaneously. It is also possible

for the agent to stand still, generating a total of 9 actions: North-Stay, North-East,

Stay-East, South-East, South-Stay, South-West, Stay-West, North-West, Stay-Stay.

Another variable, z, changes value cyclically, irrespective of the action taken, except

that, when at location z = 3, taking action Stay-East, terminates the task. The

task is also terminated by leaving the right room at the top right hand corner as

illustrated. All steps incur a cost of -1 and the rewards on termination are fixed at

a +ve value. The agent is replaced at random after termination.

It is anticipated that an automatic decomposition could find the hierarchical

Future Research 188

1

0 2

34

Z region

0 1 2 3 4

X region 0

1

2

3

4

Y region

00Room region

PreviousRoom region

< >

0

Two Room Region

Two Tasks

Figure 9.4: The expected result of an automatic decomposition of the problem infigure 9.3

structure illustrated in figure 9.4. Each variable’s discovered Markov region is shown

at the bottom of the figure. As the z region alone can reach termination reliably

there is no need to combine it with the x and y regions.

Regions x and y cannot reach the other goal reliably. Combining them as in

section 9.3 helps, but this is not the whole solution because of state aliasing. By

generating history variables from given variables it is possible to uncover hidden

state. The idea is to simultaneously use selective perception and uncover hidden

state to achieve the task at hand as in McCallum’s (1995) UTree approach. An

advantage in a hierarchical structure is that history at more abstract levels can reach

back many primitive time steps as shown by Hernandez and Mahadevan (2000).

Generating memory at the abstract level allows the rooms to be disambiguated

based only on the two abstract actions to leave the abstract room region. It avoids

having to solve the much harder hidden state problem at the primitive level which

would require, at minimum, a history going back 5 time steps.

The characteristics of the idea suggested here are that the hierarchy is still dis-

covered automatically, state abstraction is used as in HEXQ, redundant variables

Future Research 189

are eliminated and variables at higher levels in the hierarchy may even be invented.

An example of the latter characteristic is the formation of the the room variable in

the above problem. Rooms are not represented at all in the original problem but

arose out of an aliased abstract state formed by combining two regions.

9.8 Deictic Representations

In deictic representations variables define a situation relative to an agent. This

allows many variables or actions that would otherwise be defined with absolute

reference to be represented relatively and compactly. In natural language, examples

of deictic expressions are “my car”, “the ball in front of me” and “the top disk

on peg 2”. One interpretation of HEXQ regions as a class, is that they are deictic

expressions in that the higher level variables index their instantiations. For example,

the robot-location-relative-to-the-ball variable is a deictic representation in the ball

kicking example of section 6.4. The ball is represented by this variable in a state

relative to the robot, irrespective of its position on the field. For the Towers of

Hanoi example, the actions, movexy, are an example of deictic actions, as the disk

moved is implicity the one on top, but could be any particular disk. This is one of

the reasons HEXQ is able to generalise.

An interesting research direction would be to learn deictic representations if the

state or action variables provided are object specific and weakly coupled. Two

subsystems are weakly coupled when their influence on each other is at longer time

scales than interactions inside each subsystem, or when they are coupled by a small

subset of variables. The total system may be described by a vector of variables, but

it is possible to take various subsets of variables and construct self contained models.

The multiple partitioning approach suggested above can therefore be seen as a way

to exploit decoupled or weakly coupled systems. In these systems there may be a

large number of variables (where many may be dependent on just a few elementary

Future Research 190

ones). This could help to make the task of learning in complex problems tractable.

A robotic example is the ball collection challenge in the Sony legged league at

the 2002 RoboCup competition in Fukuoka, Japan (Olave et al., 2003). Two robots

were required to shoot 10 randomly placed balls into either goal in under 3 minutes.

A total environment description would include the position of each of the individual

balls. Of course the task of shooting a particular ball into a goal is only weakly

coupled to the task of shooting any other ball into the goal. The weak coupling is

manifest in that only occasionally another ball or robot presents an obstacle and

that there may be a best order in which to shoot the balls. By programming each

robot to only see its closest ball and to shoot this ball into the goal, the robots were

able to operate successfully by working on a much smaller sub-model of the total

system.

In the example in figure 9.4, a top level focus switching task could be generated to

switch between the two sub-problems. The deictic effect here is to switch attention

by “pointing” to either sub-problem and ignoring the other variables.

The research issues these type of situations present are also raised by Finney

et al. (2001). They suggest once again looking at blocks world scenarios. The

blocks world problem cannot be decomposed by HEXQ and Knoblock (1994) could

not automatically decompose the blocks world problem with ALPINE.

It would be interesting to see if problems such as the ball collection and the blocks

world could be automatically decomposed with the added machinery of hierarchical

state and action abstraction using the multi-partitioning suggested above. Despite

the negative results reported by Finney et al. (2002), introducing deictic variables in

a hierarchical structure appears not only to present compaction opportunities but

seems to account for the way humans abstract (Ballard et al., 1996).

When the interaction amongst problem variables increases, even humans have

difficulty solving these problems. This may be the reason why humans find games

like chess and Rubik’s Cube challenging. It might be instructive to direct research

Future Research 191

at the more mundane problems of learning everyday tasks such as cooking or simple

parts assembly. This should allow the study of the underlying concept forming

mechanisms before these algorithms are stress tested on hard problems.

9.9 Quantitative Variables

Variable values in this thesis can be interpreted to reflect the arbitrary states of

an agent’s sensor. HEXQ therefore only assumes that the variables are qualitative

(see section 4.1). This does not preclude the possibility that variables represent

quantitative values, but the algorithm is not predicated on this possible inductive

bias.

The general philosophic approach is that if variables do have a natural order or

describe a metric space, then this needs to be learnt by the agent.

It would be worthwhile to extend HEXQ by assuming quantitative variables.

Various dimensionality reduction techniques, such as linear combination, may then

possibly be applied to a larger set of variables as a preprocessing step to HEXQ.

HEXQ might also be combined with methods such as principal component analysis,

independent component analysis or kernelised versions of these methods.

If the quantitative variables are continuous, then it may be possible to define

mappings into a discrete multi-variate state space that lends itself to HEXQ de-

composition. For example, it is possible to decompose the pole and cart problem

(Anderson, 1986) into two hierarchical levels. At the first level the angle and angular

velocity variables can define balancing subtasks with various accelerations to the left

and right. At the second level, the translational distance and velocity can be learnt

by switching between subtasks.

The learning of the quantitative nature of variables (both discrete and contin-

uous) and their exploitation in terms of hierarchical decomposition is beyond the

scope of this thesis and is left for future research.

Future Research 192

9.10 Average Reward HEXQ

The use of average reward or discounted sum of future rewards optimality criteria

only becomes an issue when problems are infinite horizon. Chapter 7 shows how

a dual set of value functions can decompose the discounted sum of future rewards

over a task hierarchy.

There may be advantages in using average reward reinforcement learning with

task hierarchies (Ghavamzadeh and Mahadevan, 2002). The authors make the incor-

rect assumption that, given a stationary policy, the average reward for each subtask

is constant. However, the average reward for a subtask can vary depending on how

the task was initiated and in general reusable subtask will be initiated differently

depending on their context.

It is believed to be possible to define a dual value function decomposition for

average reward hierarchical reinforcement learning in the HEXQ setting along sim-

ilar lines to the decomposition of the sum of discounted future rewards. The two

functions for average reward decomposition are the average number of time-steps to

exit and the average reward to exit from any state in the subtask.

The details and confirmation of the formulation will be left to future work.

9.11 Deliberation

Planning and reinforcement learning are closely related. Sutton and Barto (1998)

explain how both planning and learning can be integrated by allowing both methods

to update the same value function. A planning process is formulated as search over

a state space generated from a model. A model (a state transition function) can be

learnt from the exploration efforts of a reinforcement learner. The model can then

be used for planning.

We have already seen hierarchies crafted by combining both reinforcement learn-

ing and planning (Ryan, 2002). Is it possible to discover abstractions such as rela-

Future Research 193

tional descriptions of planning operators automatically? Some work in this area has

already started (Theocharous and Kaelbling, 2004). The automatic discovery of hi-

erarchy with HEXQ may be extended to re-representing attribute value descriptions

in relational terms and allowing hierarchical reinforcement learning to be seamlessly

combined with planing algorithms.

9.12 Training

Even though hierarchical representations can compress complex system behaviour,

finding the appropriate building blocks or intermediate subgoals may be very ex-

pensive if an agent is left to its own devices. While much of the required bias is

expensive for the agent designer to provide beforehand, or indeed unknown, the

agent can be designed to take advantage of an environment where its experiences

are structured by a trainer to assist the process of concept formation. This would

allow an accelerated progression of the frontier of receptivity (Utgoff and Stracuzzi,

2002).

If the agent is autonomous and any interaction can only come via its sensors

and effectors, how is it trained? Even if it is possible, in principle, to change the

“program” of such a learning agent, it may be too complex or labor intensive for a

programmer to understand the self developed internal representations. One solution

is to develop training programs and curricula in which the training is delivered via

the sense-act loop of the agent. A future research direction is to find ways to design

software that can benefit from a trainer via the base level sense-act loop.

A basic approach may be for the trainer to place the agent in that part of the

environment where the agent is likely to have the most fruitful experience. This is a

variation on the idea of the trainer controlling the reward structure. When subgoals

are crystalised, the intermediate reward can be removed and training take place

at more abstract levels. This basic form of training would not require any special

Future Research 194

representation of the training situation by the agent.

A more advanced solution is where the agent is able to copy the trainer’s actions.

This requires the agent to see itself in the trainer’s shoes, so to speak. It seems

monkeys can do this (hence one of the meanings of the word ape), but dogs cannot.

At the highest levels of abstraction, languages, including scientific languages of

mathematics and logic, provide the medium for rapid training in humans. Machines

would most likely need to learn representations that correspond to human under-

standable concepts so that human computer communication can take place. This is

likely for two reason: (1) when both machines and humans learn in a similar envi-

ronment both are confronted with similar constraints shaping their concepts and (2)

if concept communication takes place throughout a learning lifetime, the concept

hierarchy that is crystalised is likely to be similar between the machine and the

human. As the next level of learning builds on the previous, having a common base

of concepts should make training easier.

9.13 Scaling to Greater Heights

The set of features describing the environment is assumed to be provided to HEXQ.

Where do these features come from?

For an autonomous system, at the lowest level of interaction the features are the

raw sensor inputs and the actions the raw effector commands. The vision sensor

and effectors for a Sony ERS-210 AIBO robot, shown in figure 9.5, make the point.

Its head and legs have 15 degrees of freedom and can be independently activated

to start to move to any target position specified in micro-radians at 8 microsecond

intervals (in time) using PID control. The vision sensor has 25,344 pixels each with

one of 16,777,216 possible colour values streaming in at 25 times a second with the

potential to generate a very large state space. This state space is both highly aliased

and redundant. An open question is how to learn to compress it to manageable size

Future Research 195

in the pursuit of higher level objectives such as winning at soccer.

Figure 9.5: The Sony AIBO robot showing an image from its colour camera.

There is a pressing need to tackle these larger scaling issues. Future research is

suggested to use the hierarchical principles outlined in this thesis together with the

right inductive bias for an agent to learn multi-levels of description. The objective

is to bridge the semantic gap between the raw sensor and effector interaction and

higher level concepts to achieve complex tasks such as playing soccer.

An outline of how hierarchic concepts might develop is as follows: an agent

is assumed to have learnt (through evolution) that the visual state space can be

abstracted to edges at various orientations. There is evidence that these features

are present in the occipital lobe (Hubel and Wiesel, 1979). It may also detect blobs

of colour or texture. At the next level of abstraction these features may be able

to form regions, corresponding to objects such as balls, goals, blocks, etc. Further

abstraction may find regions that describe the dynamic interactions between the

agent and the object, as for example in the stylised soccer player, in chapter 6, that

learns to kick the ball. While this is speculative, it would be interesting to see if

these types of hierarchies can be learnt in sophisticated machines.

Future Research 196

9.14 Conclusion

This chapter has discussed a number potential research directions to build on the

ideas introduced with HEXQ. The major topics and directions are:

• Extending automatic MDP decompositions with multiple partitions and multi-

dimensional actions.

• Relaxation of the initial MDP assumption, anticipating hidden state and using

selective perception.

• Developing more robust and flexible dynamic hierarchies.

• Creating new representations from base level variables that are deictic or re-

lational.

• Scaling to realistic autonomous robotic applications.

Chapter 10

Conclusion

This thesis has started to tackle the open problem of automatically decomposing

multi-dimensional reinforcement learning problems. It has presented decomposition

conditions and an algorithm that have met the objective of reducing the computa-

tional complexity automatically. The empirical results support both the theory and

the successful operation of the HEXQ algorithm.

The invariant decomposition approach employed by HEXQ is only applicable to

finite MDPs and its effectiveness depends heavily on the particular representation

employed. It relies on variables having different time scales, variables being con-

sistently labelled and a constrained type of stochasticity. It is not a requirement

that the user knows whether the problem is structured in this way. The HEXQ

decomposition will succeed to the extent that these constraints are present.

One of the great advantages of the decomposition is the ability to generalise to

unseen situations. The HEXQ decomposition is predicated on a multi-dimensional

description of the state space.

The next sections will summarise the main ideas and some potential research

directions.

197

Conclusion 198

10.1 Summary of Main Ideas

Markov Subspaces. The key idea in this thesis is to uncover invariance in the form

of subregions of the state space that can then be abstracted. In both cases a

model of a part of the world is tested for repeatability in various contexts and

cached away for reuse. In HEXQ the (Markov equivalent) regions are modelled

as invariant sub-MDPs in the context of all the other variables. These regions

are have the property that they can be state abstracted and their policies

treated as abstract actions in a reduced (abstract) semi-MDP.

Optimality of HEXQ. To make hierarchical learning tractable, the cache of poli-

cies is usually constrained over each region. In HEXQ, only policies leading to

each hierarchical exit states are cached. Unfortunately, this constraint on the

policies means that it is now impossible to make any optimality guarantees.

There are currently no known methods for constraining subregion policies and

guaranteeing a globally optimal solution in general. However, for determinis-

tic shortest path problems, or problems in which only the top level sub-MDP

is stochastic, the HEXQ constraints are proven to provide a globally opti-

mum solution. For stochastic shortest path problems HEXQ is hierarchically

optimal.

State Abstraction, Discounting and Infinite Horizon MDPs. The introduc-

tion of an additional decomposed discount function allows safe abstraction

when the discount factor is less than one. This dual function decomposition

makes it possible to hierarchically decompose and compact infinite horizon

MDPs where a solution may require the agent to continue in a sub-task for-

ever.

Abstract Q Functions. The HEXQ decomposition uses a value function that is

decomposed over various levels of resolution in the hierarchy. An important

aspect of the formulation of the decomposition is to treat the primitive reward

Conclusion 199

on exit of a sub-MDP as belonging to the abstract state transition and not

as a part of the internal cost to reach the exit. The HEXQ decomposition

equations reduce nicely to the normal Q learning equations if there is only one

variable in the state description or when the problem is approximated at a

particular level of resolution. The HEXQ action value function E is seen to be

the abstract generalisation of the normal Q function.

HEXQ was demonstrated on a range of finite state problems. The empirical

results have confirmed the theory and algorithmic integrity of HEXQ. They have

also highlighted some of the finer points and exposed limitations that require further

research.

10.2 Directions for Future Research

A number of research themes have emerged from this early work:

Approximation. The HEXQ decomposition represents a hierarchy of abstract

models. This structure suggests using variable resolution to make approxima-

tions when constructing and solving the HEXQ hierarchy. While the accuracy

of the variable resolution depends on the problem characteristics, it is possible

to save on storage, learning time and execution time by limiting the number

of levels that are taken into account.

General Hierarchies. One of the compelling future directions of this line of re-

search is to construct more general hierarchies by searching for abstractions

over sub-sets of variables simultaneously. This retains the key idea of finding

invariant sub-spaces, but avoids the need to have one order for the variables. It

is envisaged that both selective perception and uncovering hidden state along

the lines of UTree (McCallum, 1995) can extend the automatic discovery of

hierarchical structure.

Conclusion 200

Training instead of Programming. Relying on trial and error self exploration

alone is likely to be insufficient to allow an agent to learn effectively in complex

environments. The solution will most likely involve a teacher in the agent’s

environment who can structure the agent’s experience into a curriculum to

best assist the formation of a useful abstraction hierarchy. For this exercise to

be more than just training by varying the reinforcement signal, the agent will

need to have the sophistication to picture itself in the shoes of the trainer to

clone behaviour and to share communicable abstract concepts.

10.3 Significance

It seems crucial that effective intelligent agents should be able to view complex

environments at different levels of abstraction, decompose, focus and re-represent

problems.

This thesis has made a contribution towards this longer term objective. Within

the limited scope of a multi-dimensional reinforcement learning problem, the task

of decomposition, state and action abstraction, sub-goal identification and hierar-

chy construction is automated. Automation means that some large reinforcement

learning problems, that are otherwise intractable, can be solved efficiently, without

a designer having to specify the decomposition manually.

The belief is that the discovery and manipulation of hierarchical representations

will prove essential for lifelong learning in autonomous goal directed agents.

Bibliography

Saul Amarel. On representations of problems of reasoning about actions. In Donald

Michie, editor, Machine Intelligence, volume 3, pages 131–171, Edinburgh, 1968.

Edinburgh at the University Press.

Charles W. Anderson. Strategy learning with multuilayer connectionist representa-

tions. In Proceedings of the Fourth International Workshop on Machine Learning,

pages 103–114, Can Mateo, CA, 1986. Morgan Kaufmann.

David Andre and Stuart J. Russell. State abstraction for programmable reinforce-

ment learning agents. In Rina Dechter, Michael Kearns, and Rich Sutton, editors,

Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages

119–125. AAAI Press, 2002.

Ross Ashby. Design for a Brain: The Origin of Adaptive Behaviour. Chapman &

Hall, London, 1952.

Ross Ashby. Introduction to Cybernetics. Chapman & Hall, London, 1956.

Dana H. Ballard, Mary M. Hayhoe, Polly K. Pook, and Rajesh P. N. Rao. Deictic

codes for the embodiment of cognition. Technical Report NRL95.1, National

Resource Lab. for the Study of Brain and Behavior, U. Rochester, 1996.

Richard Bellman. Adaptive Control Processes: A Guided Tour. Princeton University

Press, Princeton, NJ, 1961.

201

BIBLIOGRAPHY 202

Dimitri P. Bertsekas and John N. Tsitsiklis. An analysis of stochastic shortest path

problems. Mathematics of Operations Research, 16:580–595, 1991.

Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Struc-

tural assumptions and computational leverage. Journal of Artificial Intelligence

Research, 11:1–94, 1999.

Craig Boutilier and Richard Dearden. Using abstractions for decision-theoretic plan-

ning with time constraints. In Proceedings of the Twelfth National Conference on

Artificial Intelligence (AAAI-94), volume 2, pages 1016–1022, Seattle, Washing-

ton, USA, 1994. AAAI Press/MIT Press.

Rodney A. Brooks. Elephants don’t play chess. Robotics and Autonomous Systems

6, pages 3–15, 1990.

Andy Clark and Chris Thornton. Trading spaces: Computation, representation, and

the limits of uninformed learning. Behavioral and Brain Sciences, 20(1):57–66,

1997.

Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to

Algorithms. MIT Press, Cambridge Massachusetts, 1999.

Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. Advances in

Neural Information Processing Systems 5 (NIPS), 1992.

Edwin D. de Jong and Tim Oates. A coevolutionary approach to representation

development. Proceedings of the ICML-2002 Workshop on Development of Rep-

resentations, pages 1–6, 2002.

Thomas Dean and Robert Givan. Model minimization in markov decision processes.

In AAAI/IAAI, pages 106–111, 1997.

Thomas Dean and S. H. Lin. Decomposition techniques for planning in stochastic

BIBLIOGRAPHY 203

domains. Technical Report CS-95-10, Department of Computer Science Brown

University, 1995.

Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value

function decomposition. Journal of Artificial Intelligence Research, 13:227–303,

2000.

Thomas G. Dietterich and Ryszard S. Michalski. A comparative review of selected

methods for learning from examples. Machine Learning, pages 41–81, 1984.

Bruce L. Digney. Emergent hiearchical control structures: Learning reactive hier-

archical relationships in reinforcement environments. From Animals to Animats

4: Proceedings of the fourth international conference on simulation of adaptive

behaviour., pages 363–372, 1996.

Bruce L. Digney. Learning hierarchical control structures for multiple tasks and

changing environments. From animals to animats 5: Proceedings of the fifth in-

ternational conference on simulation of adaptive behaviour. SAB 98, 1998.

Chris Drummond. Accelerating reinforcement learning by composing solutions of

automatically indentified subtasks. Journal of Artificial Intelligence Research, 16:

59–104, 2002.

Sarah Finney, Natalia H. Gardiol, Leslie Pack Kaelbling, and Tim Oates. The

thing that we tried didn’t work very well: Deictic representation in reinforcement

learning. 18th International Conference on Uncertainty in Artificial Intelligence,

(UAI-02), 2002.

Sarah Finney, Natalia Gardiol Hernandez, Tim Oates, and Leslie Pack Kaelbling.

Learning in worlds with objects. Working Notes of the AAAI Stanford Spring

Symposium on Learning Grounded Representations, 2001.

BIBLIOGRAPHY 204

Mohammad Ghavamzadeh and Sridhar Mahadevan. Hierarchically optimal average

reward reinforcement learning. In Claude Sammut and Achim Hoffmann, edi-

tors, Proceedings of the Nineteenth International Conference on Machine Learn-

ing, pages 195–202. Morgan-Kaufman, 2002.

Michael Bonnell Harries, Claude Sammut, and Kim Horn. Extracting hidden con-

text. Machine Learning, 32(2):101–126, 1998.

J. Hartmanis and R.E. Stearns. Algebraic Structure Theory of Sequential Machines.

Prentic-Hall, 1966.

Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas Dean, and Craig

Boutilier. Hierarchical solution of Markov decision processes using macro-actions.

In Fourteenth Annual Conference on Uncertainty in Artificial Intelligence, pages

220–229, 1998.

Bernhard Hengst. Generating hierarchical structure in reinforcement learning from

state variables. In PRICAI 2000 Topics in Artificial Intelligence, pages 533–543,

San Francisco, 2000. Springer.

Bernhard Hengst. Discovering hierarchy in reinforcement learning with HEXQ.

In Claude Sammut and Achim Hoffmann, editors, Proceedings of the Nineteenth

International Conference on Machine Learning, pages 243–250. Morgan-Kaufman,

2002.

Bernhard Hengst, Darren Ibbotson, Son Bao Pham, John Dalgliesh, Mike Lawther,

Phil Preston, and Claude Sammut. The UNSW robocup 2000 Sony legged league

team. In Peter Stone, Tucker Balch, and Gerhard Kraetzschmar, editors, RoboCup

2000: Robot Soccer World Cup IV, volume 2019 of Lecture Notes in Artificial

Intelligence subseries of Lecture Notes in Computer Science, chapter Champion

Teams, pages 64–75. Springer-Verlag, Heidelberg, 2001.

BIBLIOGRAPHY 205

Natalia Hernandez and Sridhar Mahadevan. Hierarchical memory-based reinforce-

ment learning. Fifteenth International Conference on Neural Information Pro-

cessing Systems, Nov. 27-December 2nd 2000. Denver.

John H. Holland. Hidden Order: How Adaptation Builds Complexity. Helix books.

Addison-Wesley, Reading Massachusetts, 1995.

David H. Hubel and Torsten N. Wiesel. Brain mechanisms of vision. A Scientific

American Book: the Brain, pages 84–96, 1979.

Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: Preliminary

results. In Machine Learning Proceedings of the Tenth International Conference,

pages 167–173, San Mateo, CA, 1993. Morgan Kaufmann.

Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement

learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.

Craig A. Knoblock. Automatically generating abstractions for planning. Artificial

Intelligence, 68(2):243–302, 1994.

Terran Lane and Leslie Pack Kaelbling. Nearly determinisitc apbstractions of

markov decision processes. In Eighteeth National Conference on Artificial In-

telligence (AAAI-02), pages 260–266, 2002.

R. Sh. Liptser, W. J. Runggaldier, and M. Taksar. Deterministic approximation for

stochastic control problems. SIAM Journal on Control and Optimization, 34(1):

161–178, 1996. Society for Industrial and Applied Mathematics.

Andrew McCallum. Reinforcement learning with selective perception and hidden

state. PhD thesis, Department of Computer Science, University of Rochester,

1995.

Amy McGovern and Richard S. Sutton. Macro-actions in reinforcement learning: An

BIBLIOGRAPHY 206

empirical analysis. Amherst technical report 98-70, University of Massachusetts,

1998.

Amy E. McGovern. Autonomous Discovery of Temoral Abstraction from Interac-

tion with an Environment. PhD thesis, University of Massachusetts, Amherst,

Massachusetts, 2002.

Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-Cut: Dynamic discovery of

sub-goals in reinforcement learning. volume 2430 of Lecture Notes in Computer

Science. Springer, 2002.

Donald Michie. On Machine Intelligence. Ellis Horwood Limited, Chichester, second

edition, 1986.

Donald Michie and R. A. Chambers. BOXES: An experiment in adaptive control.

Number 2, pages 137–152, Edinburgh, 1968. Oliver and Boyd.

Marvin Minsky. The Society of Mind. Simon and Schuster, New York, 1985.

Andrew Moore, Leemon Baird, and Leslie Pack Kaelbling. Multi-value-functions:

Efficient automatic action hierarchies for multiple goal mdps. In Proceedings of

the International Joint Conference on Artificial Intelligence, Stockholm, pages

1316–1323, 340 Pine Street, 6th Fl., San Francisco, CA 94104, 1999. Morgan

Kaufmann.

Andrew W. Moore. The parti-game algorithm for variable resolution reinforcement

learning in multidimensional state-spaces. In Jack D. Cowan, Gerald Tesauro, and

Joshua Alspector, editors, Advances in Neural Information Processing Systems,

volume 6, pages 711–718. Morgan Kaufmann Publishers, Inc., 1994.

Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforce-

ment learning with less data and less time. Machine Learning, 13:103–130, 1993.

BIBLIOGRAPHY 207

Craig G. Nevill-Manning. Infering Sequential Structure. PhD thesis, University

Waikato, 1996.

Nils J. Nilsson. Teleo-reactive programs for agent control. Journal of Artificial

Intelligence Research, 1:139–158, 1994.

Andre Olave, David Wang, James Wong, Timothy Tam, Benjamin Leung, Min Sub

Kim, James Brooks, Albert Chang, Nik Von Huben, and Claude Sammu-

tand Bernhard Hengst. The unsw robocup 2002 legged league team. Workshop

on Adaptability in Multi-Agent Systems: The First RoboCup Australian Open

(AORC-2003), 2003.

Ronald E. Parr. Hierarchical Control and learning for Markov decision processes.

PhD thesis, University of California at Berkeley, 1998.

Ronald E. Parr. Optimality and HAMs. personal communication, March 2002.

Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann, San Francesco, revised second printing edition,

1988.

Joelle Pineau, Nicholas Roy, and Sebastian Thrun. A hiearchical approach to

POMDP planning and execution. In Workshop on Hierarchy and Memeory in

Reinforcement Leanring, ICML-2001, 2001.

Duncan Potts and Bernhard Hengst. Concurrent discovery of task hierarchies. In

Proceedings of the AAAI Spring Symposium on Knowledge Representation and

Ontology for Autonomous Systems, pages 17–24, 2004.

Doina Precup. Temporal Abstraction in Reinforcement Learning. PhD thesis, Uni-

veristy of Massachusetts, Amherst, 2000.

William H. Press, Brian B. Flannery, Saul A. Teukolsky, and William T Vetterling.

Numerical Recipes in C. Cambridge University Press, 1988.

BIBLIOGRAPHY 208

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic

Programming. John Whiley & Sons, Inc, New York, NY, 1994.

Balaraman Ravindran and Andrew G. Barto. Model minimization in hierarchical

reinforcement learning. In Fifth Symposium on Abstraction, Reformulation and

Approximation (SARA 2002). Springer Verlag, 2002.

Balaraman Ravindran and Andrew G. Barto. SMDP homomorphisms: An alge-

braic approach to abstraction in semi markov decision processes. In To appear

in the Proceedings of the Eighteenth International Joint Conference on Artificial

Intelligence (IJCAI 03), 2003.

Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pren-

tice Hall, Upper Saddle River, NJ, 1995.

Malcolm R. K. Ryan. Hierarchical Reinforcement Learning: A Hybrid Approach.

PhD thesis, Computer Science and Engineering, Univeristy of New South Wales,

2002.

Malcolm R. K. Ryan and Mark D. Reid. Using ilp to improve planning in hierarchical

reinforcement learning. Proceedings of the Tenth International Conference on

Inductive Logic Programming, 2000.

Claude A Sammut. Learning Concepts by Performing Experiments. PhD thesis,

Department of Computer Science, University of New South Wales, 1981.

Juan C. Santamaria, Richard S. Sutton, and Ashwin Ram. Experiments with rein-

forcement learning in problems with continuous state and action spaces. Adaptive

Behavior, 6(2), 1998.

Alen D. Shapiro. Structured Induction in Expert Systems. Turing Institute Press in

association with Addison-Wesley, Workingham, England, 1987.

BIBLIOGRAPHY 209

Sidney Siegel and Jr N. John Castellan. Nonparametric Statistics for the Behavioural

Sciences. McGraw-Hill, New York, 1988.

Herbert A. Simon. The Sciences of the Artificial. MIT Press, Cambridge, Mas-

sachusetts, 3rd edition, 1996.

Satinder Singh. Reinforcement learning with a hierarchy of abstract models. In

Proceedings of the Tenth National Conference on Artificial Intelligence, 1992.

Peter Stone. Layered Learning in Multi-Agent Systems. Phd, School of Computer

Science, Carnegie Mellon University, Pittsburgh, PA, December 1998.

David L. Streiner. Maintaining standards: differences between the standard devia-

tion and standard error, and when to use each. Can J Psychiatry, 48(8):498–502,

1996.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.

MIT Press, Cambridge, Massachusetts, 1998.

Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between MDPs and semi-

MDPs: A framework for temporal abstraction in reinforcement learning. Artificial

Intelligence, 112(1-2):181–211, 1999a.

Richard S. Sutton, Satinder Singh, Doina Precup, and Balaraman Ravindran. Im-

proved switching among temporally abstract actions. In D. A. Cohn M. S. Kearns,

S. A. Solla, editor, Advances in Neural Information Processing Systems 11. MIT

Press, 1999b.

Georgios Theocharous and Leslie Pack Kaelbling. Approximate planning in

POMDPS with macro-actions. Advances in Neural Information Processing Sys-

tems 16 (NIPS-03), 2004. to appear.

Sebastian Thrun and Anton Schwartz. Finding structure in reinforcement learning.

BIBLIOGRAPHY 210

In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Informa-

tion Processing Systems (NIPS) 7, Cambridge, MA, 1995. MIT Press.

Paul E. Utgoff and David J. Stracuzzi. Many-layered learning. In Neural Computa-

tion. MIT Press Journals, 2002.

William T. B. Uther. Tree Based Hierarchical Reinforcement Learning. PhD thesis,

Computer Science, Carnegie Mellon University, 2002.

Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s

College, 1989.

Christopher J. C. H. Watkins and Peter Dayan. Technical note: Q-learning. Machine

Learning, 8:279–292, 1992.

Shimon Whiteson and Peter Stone. Concurrent layered learning. The Second Inter-

national Joint Conference on Autonomous Agents and Multiagent Systems, 2003.

to appear.

Appendix A

Mathematical Review

This appendix provides a cursory review of the algebra and statistics referred to in

this thesis.

A.1 Partitions and Equivalence Relations

HEXQ partitions the state space into equivalence classes of regions and constructs

a hierarchy of progressively coarser partitions.

A set G = {g1, . . . , gm} is called a partition of a state space S if S = ∪mi=1gi and

gi∩gj = ∅ for all i 6= j. Region g ∈ G is also referred to as a block or, in the current

context, an aggregate state. A state s ∈ S is referred to as a base-level state.

A partition G′ is a refinement of partition G if each block of G′ is a subset of

some block of G. Conversely G is said to be coarser than G′. A quotient partition

is a partition of the blocks of a partition. The quotient partition of G with respect

to G′ is a partition of the blocks of G′ where two blocks are identified if and only if

they are contained in the same block of G (Hartmanis and Stearns, 1966).

A binary relation B on set G is a subset of the Cartesian product G × G. If

(gi, gj) ∈ B the binary relation can be written giBgj where gi, gj ∈ G. When B is

reflexive, symmetric and transitive, that is

• giBgi,

• giBgj ⇔ gjBgi and

211

Mathematical Review 212

• giBgj, gjBgk ⇒ giBgk for all gi, gj, gk ∈ G,

then B is an equivalence relation. If B is an equivalence relation on set G then for

gi ∈ G, the equivalence class of gi is the set [gi] = {gj ∈ G|giBgj}.

A.2 Directed Graphs

A directed graph G is a pair (V, E), where V is a finite set of vertices and E is a set

of directed edges. E is a binary relation on V .

An MDP can be represented as a directed graph in which the states si ∈ S, i =

1, . . . , |S| are the vertices and the state transitions are directed edges. An edge exists

from state s to s′ whenever the probability of transition in a single step from s to

the s′ is greater than zero for some action a ∈ A (i.e., T ass′ > 0).

A directed graph can be decomposed into strongly connected components (SCC)

using two depth-first searches. The linear time algorithm ( O(V+E) ) shown in

table A.1 is used by HEXQ in Chapter 5 to find Markov regions. It is adapted from

Cormen et al. (1999). It takes as input the adjacency matrix adj[s][s′] signifying that

state s may transition to state s′ where s, s′ ∈ S. It outputs the strongly connected

component label of each state s, i.e. SCC[s] and the number of strongly connected

components in the graph, SCClabel, on return.


Table A.1: Function SCC finds strongly connected components of a directed graphrepresenting transitions between states. An edge from s to s’, s, s′ ∈ S, is representedby adj[s][s′]. SCC[s] is the integer label of the SCC for state s.

function SCC( states S, adj[s][s′] )

1. initialise finT ime ← 0

2. initialise SCClable ← 0

3. for each state s ∈ S

4. initialise col[s] ← WHITE

5. initialise f [s] ← undefined

6. initialise SCC[s] ← undefined

7. for each state s ∈ S if (col[s] = WHITE) DFS1(s)

8. for each state s ∈ S col[s] ← WHITE

9. for each state s ∈ S in order of decreasing f [s]

10. if(col[s] = WHITE) DFS2(s)

11. increment SCClabel

12. return SCC[·], SCClabel

DFS1(s)

13. col[s] ← GRAY

14. increment finT ime

15. for each state s′ ∈ S if (adj[s][s′] and col[s′] = WHITE) DFS1(s’)

16. col[s] ← BLACK

17. f [s] ← finT ime

18. increment finT ime

19. return

DFS2(s)

20. col[s] ← GRAY

21. for each state s′ ∈ S if (adj[s′][s] and col[s′] = WHITE) DFS2(s’)

22. col[s] ← BLACK

23. SCC[s] ← SCClabel

24. return

end SCC


A.3 Statistical Tests

In this thesis non-parametric statistical tests are used to assist in the partitioning

of a multi-dimensional state space.

The Kolmogorov-Smirnov (K-S) test is designed to test whether two independent

samples come from the same probability distribution (Siegel and N. John Castellan,

1988). The test is sensitive to the difference between the cumulative probability

distributions for the two samples. If the two samples are drawn from the same

distribution then their cumulative probability distributions can be expected to be

close to each other. A large deviation is evidence for the two samples coming from

separate distributions. The test requires that each set of samples is arranged into a

cumulative frequency distribution. For real values this implies sorting the samples by

the magnitude of their values. The maximum difference between the two cumulative

probability distributions is taken as a guide to estimate whether the two samples

come from the same distribution.

The K-S test will be used to determine whether the primitive rewards for a

transition from state s to s′ given action a is stationary. To estimate this, two

separate samples of rewards for this transition, taken at different time periods, will be

compared to see if they come from the same probability distribution. The algorithm

for the K-S test is given by McCallum (1995) who adapted it from “Numerical

Recipes in C” (Press et al., 1988). This test was successfully used by McCallum

(1995) in deciding to split states in the UTree algorithm based on the significance

of the hypothesis that states in the fringe generated a different value function from

leaf node states.

The Binomial Test is appropriate when a population consists of two classes, say

0 and 1. If the probability of sampling an instance of class 1 is p then the probability

of sampling class 0 is q = 1− p. For a sample size of N , the probability that there

are at least k instances from class 1 is


Pr[Y ≥ k] =N∑

i=k

(N

i

)piqN−i (A.1)

where (N

k

)=

N !

k!(N − k)!. (A.2)

The value of p may vary from population to population. If the null hypothesis

is that p has a particular value, then it is possible to test whether it is reasonable to

believe that a sample comes from this distribution at a specific level of significance.

The two tailed test is used to determine whether the probability of transitioning

from state s to s′ given action a, T ass′ , is stationary. The long term probability, p,

is estimated from the frequencies of transition from s to s′. A sample of transitions

from a particular time period is then tested to see if they could reasonably have

come from this distribution. If it fails the hypothesis is rejected and the transition

is declared non-stationary.

Appendix B

Significance Values in Graphs

B.1 Error Bars and Confidence Intervals

95% Confidence Interval

-4-3.5

-3-2.5

-2-1.5

-1-0.5

00.5

1

0

1000

0

2000

0

3000

0

4000

0

5000

0

6000

0

7000

0

8000

0

9000

0

1000

00

Time Steps

Ave

rag

e R

ewar

d p

er T

ime

Ste

p

Figure B.1: The 95% confidence interval in the the mean performance of the stochas-tic taxi using a flat reinforcement learner.

For the empirical evaluations in Chapter 6 each of the experiments was averaged

over 100 runs. Error bars have been omitted in the graphs in this Chapter to avoid

clutter. As an indication of the variation in results, the 95% confidence interval is

216

Significance Values in Graphs 217

plotted here for two of the experiments. We are interested in the confidence of the

reported mean that is assumed to be normally distributed.

Figure B.1 shows the 95% confidence interval for the mean of the stochastic taxi

results using a flat reinforcement learner from section 6.1.2. The confidence interval

of the mean was estimated by taking two standard deviations above and below the

mean of 100 sample experiments. Each experiment consisted of 100 runs.

Standard Error x 2 (HEXQ on Taxi)

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

010

0020

0030

0040

0050

0060

0070

0080

0090

00

1000

0

Time Steps

Ave

rag

e R

ewar

d P

er T

ime

Ste

p

Figure B.2: The 95% confidence interval (2 times SE) for the stochastic taxi perfor-mance for HEXQ excluding construction.

Figure B.2 calculates the variation in the mean using the standard deviation of

the underlying data. It shows the 95% confidence interval using error bars for the

initial time steps of the stochastic taxi performance results for HEXQ excluding

construction from section 6.1.2. The confidence interval was estimated by taking

twice the standard error (Streiner, 1996) above and below the mean over 100 runs .

The analysis above shows that the variance in the expected means over 100 runs

is low enough to make the comparison between the graphs meaningful.


B.2 Taxi Learning Rate and Convergence

The temporal difference learning rates in this thesis have been kept constant, because

it was felt that efficient convergence of sub-MDPs is largely an independent issue

from the automation of hierarchy discovery. This way the task of tuning the learning

rate was avoided.

Watkins and Dayan (1992) proved that reinforcement learning will converge if

all state action pairs are visited infinitely often and the learning rate βi for the i th

update for every state-action pair (s, a) meets the conditions

∞∑i=1

βi = ∞ (B.1)

and∞∑i=1

β2i < ∞. (B.2)

With a constant learning rate, it is instructive to understand how the learning

rate value effects the convergence for the stochastic taxi problem to help interpret

the results.

If an agent in state s takes action a, receives a stochastic reward r and moves to

the next state s’, then the Q action value function is update with the training rule:

Q(s, a) ← (1− βi)Q(s, a) + βi[r + γ maxa′

Q(s′, a′)] (B.3)

The graph in figure B.3 below gives an appreciation of the variation in perfor-

mance for the stochastic taxi task from section 6.1.2 for various fixed settings of the

learning rate.

As the learning rate is decreased, convergence takes longer and performance

improves. This is not surprising. Large learning rates move the the estimate Q values

more quickly towards their optimal values but the estimates are then nudged around

by the stochastic nature of the experience, resulting in a poorer performance as some

greedy actions become sub-optimal. Smaller learning rates move estimates more


0.32

0.33

0.34

0.35

0.36

0.37

0.38

0.39

0.4

0.41

0.42

0 100000 200000 300000 400000

Primitive time Steps

Mo

vin

g a

vera

ge

rew

ard

per

tim

e st

ep

0.040.05

0.1

0.2

0.25

0.3

Figure B.3: The performance of the converged policy improves with a smaller learn-ing rate parameters β with a flat reinforcement learner in Taxi domain with stochas-tic actions.

slowly towards the optimal values, in a less volatile fashion as each new experience

makes only small adjustments. There is hence less risk that Q values drift sufficiently

far to effect the optimal policy.

The learning rate of 0.25 was chosen for its rapid approach to the correct Q

function. The only significance is that when results between HEXQ and a flat

learner are compared, one reason they may not appear to converge exactly to the

same function values, as for example in figure 6.7, is the above effect due to fixing

the learning rate.

List of Figures

1.1 A maze showing three rooms interconnected via doorways. Each room

has 25 positions. The aim of the robot is to reach the goal. . . . . . . 4

1.2 The maze showing the cost to reach the goal from any location. . . . 5

1.3 The maze, decomposed into rooms, showing the number of steps to

reach each room exit on the way to the goal. . . . . . . . . . . . . . . 5

1.4 The maze, abstracted to a reduced model with only three states, one

for each room. The arrows indicate transitions for the abstract actions

that are the policies for leaving a room to the North, South, East and

West. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 An agent interacting with its environment and receiving a reward signal. 13

2.2 Episodic MDP showing the transition on termination to a hypothet-

ical absorbing state . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Ashby’s 1952 depiction of a gating mechanism to accumulate adap-

tions for recurrent situations. . . . . . . . . . . . . . . . . . . . . . . 28

3.2 The maze from Chapter 1 reproduced here for convenience. . . . . . . 31

4.1 A simple example showing three rooms interconnected via doorways.

Each room has 25 positions. The aim of the agent is to reach the goal. 47

220

LIST OF FIGURES 221

4.2 For transition (x3, y0) →b (x2, y1) the y variable changes value. As

this is a variant transition, ((x3, y0), b) is an exit. If all states were

in the one region, then entry state (x0, y0) cannot reach exit state

(x3, y1) using non-exit actions. Two regions are therefore necessary

to meet the HEXQ partition requirements. . . . . . . . . . . . . . . . 51

4.3 In this example all states have the same y value. If all states are in the

one region, then entry state (x5, y0) cannot reach exit state (x3, y0).

Therefore, two regions are necessary to meet the HEXQ partition

requirements. The inter-block transition means that ((x3, y0), b) is an

exit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 The transitions (x0, y0) →b (x1, y0) and (x0, y1) →b (x1, y1) have dif-

ferent associated rewards and hence give rise to exits ((x0, y0), b) and

((x0, y1), b) by condition 2 of variant transitions. . . . . . . . . . . . . 52

4.5 HEXQ partitioning of the maze in figure 4.1. The state representation

effects the partitioning. In (a) the agent uses a position in room and

room sensor resulting in three regions (b). In (c) the agent uses a

coordinate like sensor, that partitions the state space into the 15

regions shown in (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6 The HEXQ tree for the simple maze showing the top level semi-MDP

and the 12 sub-MDPs, 4 for each region. The numbers shown for the

sub-MDP are the position-in-room variable values. . . . . . . . . . . . 57

4.7 An example trajectory under policy π, for N = 4 steps, where sub-

MDP m invokes sub-MDP ma using abstract action a, showing the

sum of primitive rewards to the exit state sa and the primitive reward

on exit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.8 The value of state (3, 0) is composed of two HEXQ E values. . . . . . 62

4.9 A region with two exits, where the HEXQ decomposition misses a

potentially low cost exit policy from the region. . . . . . . . . . . . . 64

LIST OF FIGURES 222

4.10 For the region in figure 4.9 the optimal policies for the two sub-MDPs

created by HEXQ (one for each exit) are shown in (a) and (b). The

optimal policy to use either exit, shown in (c), is not available to

HEXQ by construction. . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.11 The maze HEXQ graph with sub-MDPs represented compactly . . . . 70

4.12 The simple maze HEXQ graph with top level sub-MDP abstracted. . 72

4.13 The multi-floor maze . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.14 The HEXQ tree of sub-MDPs generated from the multi-floor maze . . 74

4.15 The HEXQ directed acyclic graph of sub-MDPs derived from the

HEXQ tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 The simple maze example introduced previously. The invariant sub-

space regions are the rooms. The lower half shows four sub-MDPs,

one for each possible room exit. The numbers are the position-in-

room variable values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 The Markov Equivalent region for the maze example showing the four

possible exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3 The maze example with a corner obstacle and an expensive transition

in room 0 giving rise to non-stationary transitions from the perspec-

tive of the location-in-room variable. . . . . . . . . . . . . . . . . . . 86

5.4 All actions in this example are assumed to have some probability of

transitioning to adjacent states. Part (a) illustrates two such actions

near doorways. Function Regions breaks a room iteratively into single

state MERs. The results of the first three iterations are shown as

parts (b), (c) and (d). . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5 Four SCCs joined to form two MERs . . . . . . . . . . . . . . . . . . 92

5.6 Two exits of a level 2 MER that requires 2 separate level 2 sub-MDPs

even though both exits have the same level 2 exit state, S21 . . . . . . . 94

LIST OF FIGURES 223

6.1 The Taxi Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.2 The directed graph of projected state transitions for the taxi location.

Exits are not counted as edges of the graph. . . . . . . . . . . . . . . 107

6.3 The four level 1 sub-MDPs for the taxi domain, one for each hierar-

chical exit state, constructed by HEXQ. . . . . . . . . . . . . . . . . 108

6.4 State transitions for the passenger location variable at level 2 in the

hierarchy. There are 4 exits at level 2.. . . . . . . . . . . . . . . . . . 109

6.5 The top level subMDP for the taxi domain showing the abstract ac-

tions leading to the goal. . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.6 The HEXQ graph showing the hierarchical structure automatically

generated by the the HEXQ algorithm. . . . . . . . . . . . . . . . . . 112

6.7 Performance of HEXQ with and without hierarchy construction against

MAXQ and a flat MDP for the stochastic taxi. . . . . . . . . . . . . . 113

6.8 Performance of the stochastic taxi with 4 variables compared to the

three variable representation. . . . . . . . . . . . . . . . . . . . . . . 117

6.9 Taxi domain with four variables, (a) x and y coordinates for the taxi

location, (b) the y variable MER for deterministic actions, (c) the

three MERs for deterministic actions when the x variable is forced to

level 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.10 Performance of the taxi with a fickle passenger compared to the orig-

inal decisive passenger. HEXQ results do not include the time-steps

required for construction of the HEXQ graph. . . . . . . . . . . . . . 120

6.11 The Taxi with Fuel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.12 The MER in the taxi with fuel problem showing the taxi location ×fuel level state space and some example transitions. . . . . . . . . . . 124

6.13 Performance of the taxi with fuel averaged over 100 runs using stochas-

tic actions and a two variable state vector. HEXQ attains optimal

performance after it is switched to hierarchical greedy execution . . . 125

LIST OF FIGURES 224

6.14 HEXQ partitioning of an MDP with 25 Rooms. The numbers indicate

the values for the position-in-room variable. . . . . . . . . . . . . . . 128

6.15 The optimal value for each state after HEXQ has discovered the one

way barrier constructed in the room containing the goal. The barrier

was constructed in two separate ways (1) by using a border that

only allows transitions North and (2) a virtual barrier by imposing a

reward of -100 for transitioning South. Both barriers produced similar

value functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.16 The Tower of Hanoi puzzle with seven discs showing (a) the start

state, (b) an intermediate state and (c) the goal state. . . . . . . . . . 131

6.17 The directed graph for level 1 in the decomposition of the Tower of

Hanoi puzzle showing one MER and six exits. . . . . . . . . . . . . . 132

6.18 The directed graph for level 2 in the decomposition of the Tower of

Hanoi puzzle showing one MER and six exits. . . . . . . . . . . . . . 134

6.19 The performance comparison of a flat reinforcement learner and HEXQ

on the Tower of Hanoi puzzle with 7 discs. . . . . . . . . . . . . . . . 135

6.20 The directed graph for level 1 in the decomposition of the stochastic

Tower of Hanoi puzzle showing one MER and 12 exits. . . . . . . . . 137

6.21 A simulated stylised bipedal agent showing its various stances. . . . . 138

6.22 The four stances (left) that comprise a successful traversal of a hexag-

onal cell (right). Each of the six directions has 4 associate positions

across the cell. One set is illustrated. . . . . . . . . . . . . . . . . . . 139

6.23 The stylized soccer field illustrating the stochastic nature of the soccer

ball. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.24 The deictic representation of the location of the ball relative to the

agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

LIST OF FIGURES 225

6.25 The HEXQ graph for the ball kicking domain. Level 1 regions learn

to “walk”, level 2 regions learn to kick the ball and the top level learns

to kick goals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.1 For a discounted value function, the amount of discount applied after

exiting the sub-MDP depends on the number of steps required to

reach the exit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.2 A simple example showing the state abstracted decomposition of a

discounted value function. (a) shows a 2-dimensional MDP with two

identical regions with one exit to the right. The deterministic actions

are move left and right, all rewards are -1. The reward on termination

is 20. The discount factor is 0.9. (b) the composed value function

for each state. (c) and (d) are the abstracted sub-MDP value and

discount functions respectively. . . . . . . . . . . . . . . . . . . . . . 154

7.3 The infinite horizon taxi task. The graph shows that HEXQ finds and

switches policies similarly to that of a flat reinforcement learner for

various values of positive reward at $. As the reward at $ increases

the taxi stops delivering passengers and instead visits the $ location

as frequently as possible. This provides confirming evidence for the

correct operation of HEXQ for an infinite horizon problem, even when

the optimum solution requires continuing in the non-exit navigation

sub-MDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.4 The soccer player showing (a) the simulated robot leg positions, (b)

400 discrete ball locations on the field, (c) the discounted value of

states in the level 2 no-exit sub-MDP when the robot is rewarded

for running around the ball and (d) a snapshot of the robot running

around the ball. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

LIST OF FIGURES 226

8.1 The plan view of two out of ten floors connected via two banks of

elevators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8.2 The simple room example with wider doorways. . . . . . . . . . . . . 171

8.3 Kaelbling’s 10×10 navigation maze. The regions represent a Voronoi

partition given the circled landmarks. . . . . . . . . . . . . . . . . . . 174

8.4 Performance of HEXQ on stochastic version of Kaelbling’s maze.

Steps per trial are collected in buckets of 10 samples over 10 runs.

The trend lines are 255 point moving averages. . . . . . . . . . . . . . 176

9.1 A simple navigation problem to leave a room. The agent uses one

step moves in 4 compass directions to reach the goal state from any

starting position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9.2 New hierarchical structure for the simple navigation problem using

multiple partitions at level 1. . . . . . . . . . . . . . . . . . . . . . . . 183

9.3 A partially observable environment with two independent tasks and

2-dimensional actions. . . . . . . . . . . . . . . . . . . . . . . . . . . 187

9.4 The expected result of an automatic decomposition of the problem in

figure 9.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

9.5 The Sony AIBO robot showing an image from its colour camera. . . . 195

B.1 The 95% confidence interval in the the mean performance of the

stochastic taxi using a flat reinforcement learner. . . . . . . . . . . . 216

B.2 The 95% confidence interval (2 times SE) for the stochastic taxi per-

formance for HEXQ excluding construction. . . . . . . . . . . . . . . 217

B.3 The performance of the converged policy improves with a smaller

learning rate parameters β with a flat reinforcement learner in Taxi

domain with stochastic actions. . . . . . . . . . . . . . . . . . . . . . 219

List of Tables

2.1 Action-Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1 The HEXQ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Frequency of change for the rooms example variables over 2000 ran-

dom steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Function Regions finds all the Markov Equivalent Regions (MERs)

at level e given a directed graph for a state space Se, Exits(Se) and

Entries(Se) sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4 Procedure for evaluating the optimal value of a hierarchical state

in a HEXQ graph. The function returns the optimal value of the

hierarchical state s based on the optimal policy for each sub-MDP

and finds the best greedy action at every level up to e . . . . . . . . 96

5.5 Function Execute solves sub-MDP m associated with abstract action

a at level e. The state s is the current hierarchical state at each level

depending on context in which it is used. lse is the last projected

state at level e. The learning rate β is set to a constant value. The

E tables are originally initialised to 0. . . . . . . . . . . . . . . . . . . 99

6.1 The HEXQ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2 Frequency of change for taxi domain variables over 2000 random steps.105

227

LIST OF TABLES 228

6.3 Storage requirements in terms of the number of table entries for the

value function for the flat MDP, MAXQ, HEXQ, HEXQ with stochas-

tic actions after eliminating no-effect actions and HEXQ with deter-

ministic actions after eliminating no-effect actions. . . . . . . . . . . . 115

6.4 The number of E action-value table entries required to represent the

decomposed value function for the soccer player compared to the the-

oretical number of Q values required for a flat learner. . . . . . . . . . 143

A.1 Function SCC finds strongly connected components of a directed

graph representing transitions between states. An edge from s to

s’, s, s′ ∈ S, is represented by adj[s][s′]. SCC[s] is the integer label

of the SCC for state s. . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Documents

Discovering Hierarchy in Reinforcement Learningbernhardh/BHPhD2003.pdf · 2004-07-03 · Abstract This thesis addresses the open problem of automatically discovering hierarchical