Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Discovering Hierarchy in
Reinforcement Learning
Bernhard Hengst
PhD Thesis
School of Computer Science and Engineering
University of New South Wales
Australia
December 2003
c©2003 Bernhard Hengst
Declaration
I hereby declare that this submission is my own work and to the best
of my knowledge it contains no material previously published or writ-
ten by another person, nor material which to a substantial extent has
been accepted for the award of any other degree or diploma at UNSW
or any other educational institution, except where due acknowledgement
is made in the thesis. Any contribution made to the research by others,
with whom I have worked at UNSW or elsewhere, is explicitly acknowl-
edged in the thesis.
I also declare that the intellectual content of this thesis is the product
of my own work, except to the extent that assistance from others in
the project’s design and the conception or in style, presentation and
linguistic expression is acknowledged.
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Abstract
This thesis addresses the open problem of automatically discovering hierarchical
structure in reinforcement learning.
Current algorithms for reinforcement learning fail to scale as problems become
more complex. Many complex environments empirically exhibit hierarchy and can
be modelled as interrelated subsystems, each in turn with hierarchic structure. Sub-
systems are often repetitive in time and space, meaning that they reoccur as com-
ponents of different tasks or occur multiple times in different circumstances in the
environment. A learning agent may sometimes scale to larger problems if it success-
fully exploits this repetition. Evidence suggests that a bottom up approach that
repetitively finds building-blocks at one level of abstraction and uses them as back-
ground knowledge at the next level of abstraction, makes learning in many complex
environments tractable.
An algorithm, called HEXQ, is described that automatically decomposes and
solves a multi-dimensional Markov decision problem (MDP) by constructing a multi-
level hierarchy of interlinked subtasks without being given the model beforehand.
The effectiveness and efficiency of the HEXQ decomposition depends largely on the
choice of representation in terms of the variables, their temporal relationship and
whether the problem exhibits a type of constrained stochasticity.
The algorithm is first developed for stochastic shortest path problems and then
extended to infinite horizon problems. The operation of the algorithm is demon-
strated using a number of examples including a taxi domain, various navigation
tasks, the Towers of Hanoi and a larger sporting problem.
The main contributions of the thesis are the automation of (1) decomposition,
(2) sub-goal identification, and (3) discovery of hierarchical structure for MDPs with
states described by a number of variables or features. It points the way to further
scaling opportunities that encompass approximations, partial observability, selective
perception, relational representations and planning. The longer term research aim
is to train rather than program intelligent agents.
Acknowledgements
I would like to thank Professor Claude Sammut, my supervisor, for his support
over the years. He identified “scaling” as an important issue facing reinforcement
learning, which directly led to this thesis topic. My understanding of research has
benefited from our numerous conversations and his critical insights. He has sup-
ported my attendance at ICML conferences and my participation in the RoboCup
Sony legged league for four years.
I have enjoyed tutoring the Introduction to AI course for two years and guest
lecturing in Machine Learning. I thank my co-supervisor Achim Hoffmann, Claude
Sammut and Mike Bain for these opportunities.
Donald Michie was a long term visitor at the school on several occasions. I
thank him for his initial encouragement and later for the stimulating discussions on
structured induction over extended lunches and via long e-mails.
I have found the many researchers in machine learning, that I have had reason
to contact personally, to always be responsive and willing to offer assistance, despite
their onerous time commitments. They include Chuck Anderson, David Andre,
Andy Barto, Tom Dietterich, Milos Hauskrecht, Richard Korf, Sridhar Mahade-
van, Shie Mannor, Tom Mitchell, Andrew Moore, Ron Parr, Balaraman Ravindran
(Ravi), Sebastian Thrun, Paul Utgoff, Chris Watkins and Rich Sutton.
I recall “discovering” reinforcement learning in 1998 and purchasing Rich Sutton
and Andy Barto’s introductory book on the subject. Rich Sutton did not want the
answers to the exercises widely distributed, but promised to email them to me, a
chapter at a time, if I first sent him my attempted answers. This discipline helped
develop an understanding of the subject which served me well in the intervening
years.
I value the friends and associations made with other research students, past and
present, in the department of artificial intelligence: Rex Kwok, Michael Harries,
Waleed Kadous, Mark Peters, Jane Brennan, JP Bekmann, Phil Preston, Barry
Drake, Duncan Potts, Andy Isaacs, Malcolm Ryan, Mark Reid, Cameron Stone,
James Westendorp and Solly Brown.
On our first meeting, Barry Drake had to explain the meaning of an aliased
state to me. I thank Barry for his assistance, the long philosophic discussion over
iv
coffee, the organisation of the special interest group meetings on spatial topics (SIG
Spatial) and the many games of chess.
Duncan Potts paid me the compliment of reproducing the HEXQ algorithm and
results, based only on the cryptic description in the 2002 ICML paper. I discovered
this one day on his web site, quite by accident. He has since made his own contri-
bution to HEXQ, speeding up the algorithm. I thank him for the subsequent useful
discussions.
I am grateful to Phil Preston for his early and continuing help in the lab with
everything from manufacturing video cables to directing me to security facilities. He
has always been willing to give of his time.
The School’s support staff deserve special recognition. Without them the many
details of room bookings, shipments, travel and function arrangements would not
have proceeded as smoothly. In particular I would like to thank Les Sharpley,
Mariann Davies, Tanya Oshuiko, Sue Lewis, Ann Baker, Magda Chambers and
Brad Hall.
My participation in the RoboCup Sony legged league soccer competition has
been a significant research motivator. It was made all the more rewarding by the
many associations with the staff and students involved in this project over the years.
They include: Son Pham, Darren Ibbotson, John Dalgliesh, Mike Lawther, Tak
Fai Yik, Martin Siu, Spencer Chen, Tom Vogelgesang, Ken Nguyen, Hao Nguyen,
Andres Olave, David Wang, James Wong, Nik Von Huben, James Brooks, Tim
Tam, Min Sub Kim, Alan Tay, Benjamin Leung, Albert Chang, Ricky Chen, Eric
Chung, Ross Edwards, Eileen Mak, Raymond Sheh, Nic Sutanto, Terry Tam, Alex
Tang, Nathan Wong, Brad Hall, Claude Sammut, Alan Blair, Maurice Pagnucco,
Will Uther and Tatjana Zrimec. This league would not of course have been possible
without Sony. Masahiro Fujita, Principal Scientist and System Architect and his
staff at the Intelligent Dynamics Laboratory have been most helpful and supportive.
I would like to thank the following people for reading parts or the whole of this
thesis prior to submission and providing me with valuable feedback: Duncan Potts,
Eduardo Morales, Barry Drake, Achim Hoffmann, Waleed Kadous, Coral Hengst,
Alan Blair, Solly Brown and Claude Sammut. If the thesis is any more readable it
is due to their helpful suggestions.
v
A special thanks to my son, Kyle Hengst, who animated the optimum behaviour
policy found for the ball kicking domain. I have exploited the visual impact of these
animations on a number of occasions.
Finally , I would like to thank my sons Kyle and Shane and particularly my wife,
Coral, for their love and support during this midlife change in career direction.
This thesis is dedicated to the memory of my grandfather,
Joseph Zenker.
15
9
6
4
13
1
14
3
10
8
19
12
5
7
11
16
2
17
18
Contents
1 Introduction 1
1.1 Scope and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 HEXQ Automatic Decomposition . . . . . . . . . . . . . . . . . . . . 7
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Preliminaries 13
2.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Markov Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Semi-Markov Decision Problems . . . . . . . . . . . . . . . . . . . . . 19
3 Background 22
3.1 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Learning Hierarchical Models in Stages . . . . . . . . . . . . . . . . . 25
3.4 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . 27
3.4.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 HAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 MAXQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.4 Optimality of Hierarchical Structures . . . . . . . . . . . . . . 36
3.5 Learning Hierarchies: The Open Question . . . . . . . . . . . . . . . 38
vii
CONTENTS viii
3.5.1 Bottleneck and Landmark States . . . . . . . . . . . . . . . . 39
3.5.2 Common Sub-spaces . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.3 Multi-dimensional States . . . . . . . . . . . . . . . . . . . . . 40
3.5.4 Markov Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.5 Other Approaches to Discovering Hierarchy . . . . . . . . . . 43
3.6 Conclusions and Motivation for HEXQ . . . . . . . . . . . . . . . . . 44
4 HEXQ Decomposition 46
4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 HEXQ Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Partitioning MDPs of Dimension Two . . . . . . . . . . . . . 49
4.2.2 Sub-MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Top level semi-MDP . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.4 Higher Dimensional MDPs . . . . . . . . . . . . . . . . . . . . 58
4.2.5 Value Function Decomposition with HEXQ Trees . . . . . . . 58
4.3 Optimality of HEXQ Trees . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1 Globally Optimal HEXQ . . . . . . . . . . . . . . . . . . . . . 66
4.4 Representing HEXQ trees compactly . . . . . . . . . . . . . . . . . . 69
4.4.1 Markov Equivalent Regions (MERs) . . . . . . . . . . . . . . 69
4.4.2 State Abstracting Markov Equivalent Regions . . . . . . . . . 71
4.4.3 Compaction of Higher Dimensional MDPs . . . . . . . . . . . 72
5 Automatic Decomposition: The HEXQ Algorithm 76
5.1 Variable ordering heuristics . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Finding Markov Equivalent Regions . . . . . . . . . . . . . . . . . . . 82
5.2.1 Discovering Exits . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Forming Regions . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 Creating and Solving Region Sub-MDPs . . . . . . . . . . . . . . . . 92
5.4 Hierarchical State Value . . . . . . . . . . . . . . . . . . . . . . . . . 95
CONTENTS ix
5.5 State and Action Abstraction . . . . . . . . . . . . . . . . . . . . . . 95
5.6 HEXQ Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Efficiency Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7.1 No-effect Actions . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.7.2 Combining Levels . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Empirical Evaluation 102
6.1 The Taxi Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.1 Automatic Decomposition of the Taxi Domain . . . . . . . . . 104
6.1.2 Taxi Performance . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.1.3 Taxi with Four State Variables . . . . . . . . . . . . . . . . . 116
6.1.4 Fickle Taxi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.5 Taxi with Fuel . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Twenty Five Rooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3 Towers of Hanoi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4 Ball Kicking - a larger MDP . . . . . . . . . . . . . . . . . . . . . . . 137
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7 Decomposing Infinite Horizon MDPs 146
7.1 The Decomposed Discount Function . . . . . . . . . . . . . . . . . . . 149
7.2 Infinite Horizon MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.3 Infinite Horizon Experiments . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.1 Continuing Taxi . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.3.2 Ball Kicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8 Approximations 163
8.1 Variable Resolution Approximations . . . . . . . . . . . . . . . . . . . 164
8.1.1 Hierarchies of Abstract Models . . . . . . . . . . . . . . . . . 165
CONTENTS x
8.1.2 Variable Resolution Value Function . . . . . . . . . . . . . . . 167
8.1.3 Variable Resolution Exit States . . . . . . . . . . . . . . . . . 169
8.1.4 Variable Resolution Exits . . . . . . . . . . . . . . . . . . . . 170
8.2 Stochastic Approximations . . . . . . . . . . . . . . . . . . . . . . . . 173
8.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 177
9 Future Research 178
9.1 Discovering Exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.2 Stochasticity at Region Boundaries . . . . . . . . . . . . . . . . . . . 180
9.3 Multiple Simultaneous Regions . . . . . . . . . . . . . . . . . . . . . 181
9.4 Multi-dimensional Actions . . . . . . . . . . . . . . . . . . . . . . . . 183
9.5 Default Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.6 Dynamic Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.7 Selective Perception and Hidden State . . . . . . . . . . . . . . . . . 186
9.8 Deictic Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.9 Quantitative Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.10 Average Reward HEXQ . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.11 Deliberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.12 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.13 Scaling to Greater Heights . . . . . . . . . . . . . . . . . . . . . . . . 194
9.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10 Conclusion 197
10.1 Summary of Main Ideas . . . . . . . . . . . . . . . . . . . . . . . . . 198
10.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . 199
10.3 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
BIBLIOGRAPHY 201
APPENDICES 210
CONTENTS xi
A Mathematical Review 211
A.1 Partitions and Equivalence Relations . . . . . . . . . . . . . . . . . . 211
A.2 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
A.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
B Significance Values in Graphs 216
B.1 Error Bars and Confidence Intervals . . . . . . . . . . . . . . . . . . . 216
B.2 Taxi Learning Rate and Convergence . . . . . . . . . . . . . . . . . . 218
List of Figures 220
List of Tables 227
Chapter 1
Introduction
The objective of this thesis is to study an approach to the discovery of hierarchical
structure in reinforcement learning. The key idea is to automatically find invari-
ant reusable sub-tasks and abstract them to form a reduced model of the original
problem that requires significantly less space and time for its solution.
Reinforcement learning scales poorly because the state space description of a
problem grows exponentially with the number of variables. Bellman (1961) referred
to this as the “curse of dimensionality” adding that sheer enumeration will not solve
problems of any significance.
Many large problems have some structure that allows them to be broken down
into sub-problems and represented more compactly. The sub-problems, being smaller,
are often solved more easily. The solutions to the sub-problems may be combined to
provide the solution for the original larger problem. This decomposition may make
finding the final solution significantly more efficient. Designers usually decompose
problems manually, however, automating the decomposition of a problem appears to
be more difficult. It is desirable to have machines with this ability to free designers
from this task and to allow the machines to adapt to new and unforeseen situations.
1
Introduction 2
1.1 Scope and Assumptions
The scope of this thesis is limited to a particular heuristic algorithm for the auto-
matic hierarchical decomposition and solution of finite Markov decision problems
(MDPs)1. The algorithm is called HEXQ.
The states of the MDP are assumed to be multi-dimensional, meaning, that they
are defined by a number of variables that describe the features of the problem. A
robot’s location, for example, may be given by two variables, the room it occupies
and its position in the room. The variables can be interpreted as a robot’s sensor
readings that provide information about its environment.
A model of the environment describes the probabilistic state transition and re-
ward received by the agent after taking an action. It is assumed that the learning
agent is not in possession of a model of its environment at the beginning, but has
to discover this for itself during learning.
The decomposition heuristic employed by HEXQ proceeds on a variable by vari-
able basis. For many problems it is possible for an agent to learn a partial model
over a subset of the variables that is invariant in all contexts represented by the
rest of the variables. Reusable policies may be learnt over the regions of the state
space defined by the partial model. The original problem is reduced and solved by
factoring out the region details. For example, a robot may learn to navigate inside
rooms and out of doorways. The detailed skills of intra-room navigation can be en-
capsulated as abstract actions and whole rooms considered as the states of a smaller
abstract MDP. In this smaller problem the focus is on learning the best policy to
traverse rooms, to say leave a building. If, in addition, the detailed room leaving
skills can be reused, the overall savings in learning time and policy storage can be
significant.
No assumption needs to be made about the interdependence of the variables
1An MDP is formally described in Chapter 2.
Introduction 3
describing the MDP, but an MDP will only decompose in a useful way, given some
constraints in the problem in addition to the Markov property2. An informal char-
acterisation of the constraints required for a useful HEXQ decomposition is that
• the state space is described by a set of variables
• some variables must change on a slower time scale than others,
• the more frequently changing variables should retain the same values to rep-
resent similar features in the context of the slower changing variables, and
• policies can be learnt over the more frequently changing variables to control
the way sub-space regions can be traversed.
The designer is required to ensure that the problem is specified as a Markov
decision problem. The designer is not required to know how the problem can de-
compose, if at all. However, the definition of the variables can make a significant
difference to the efficient decomposition of a problem. If a designer is in a posi-
tion to influence the choice of variables, then an understanding of the operation of
HEXQ may allow a more judicious selection of these variables to obtain a better
decomposition. Section 4.2.1, for example, illustrates different decompositions when
a robot’s position is described using a coordinate system and a local representation.
The quality of the HEXQ decomposition, and HEXQ’s ability to generalise, will
depend on whether any structure in the MDP has been captured by the variables
in some meaningful way.
1.2 A Simple Example
A simple example will help make the learning problem concrete and illustrate the
basic concepts. This example has been chosen because the decomposition is easy to
2The Markov property is defined in Chapter 2.
Introduction 4
visualise3.
Goal
Figure 1.1: A maze showing three rooms interconnected via doorways. Each roomhas 25 positions. The aim of the robot is to reach the goal.
Figure 1.1 shows a maze with three rooms interconnected via doorways. Each
room has 25 positions labelled in the same manner for each room. The robot’s
objective is to learn how to quickly leave the maze via either of the two exits to the
goal. The robot has two sensors. One to tell it which room it is in and one that
measures its position in a room. The robot is started at a random location in the
maze. The actions available are to hop deterministically to an adjacent cell in one
of the four compass directions: North, South East or West. If the robot hops into
a barrier it remains where it is, shaken but uninjured. Every action costs one unit,
except the hop to the goal, which is free. On each hop, the robot’s sensors provide
its location inside the maze, that is, state=(room, position-in-room).
A reinforcement learner can explore this problem and learn an optimal value
function (distance to the goal) by backing up the values from each goal, adding one
unit of cost for each additional step to reach the goal. The resulting optimal value
function is shown in figure 1.2. To reach the goal in the quickest way from any
location, it is a simple matter of continually hopping to the next lowest value cell.
The three rooms have identical internal transition and reward functions, yet a
3The very simplicity of this example will not allow it to be used to highlight the scaling potential
Introduction 5
11 10 9 8 7
10 9 8 7 6
9 8 7 6 5
8 7 6 7 6
7 6 5 6 7
6 5 4 5 6
5 4 3 4 5
4 3 2 3 4
3 2 1 2 3
2 1 0 1 2
6 5 4 3 2
5 4 3 2 1
4 3 2 1 0
5 4 3 2 1
6 5 4 3 2
Figure 1.2: The maze showing the cost to reach the goal from any location.
reinforcement learner explores each room without reference to any of the others.
To decompose the problem automatically the robot could explore the behaviour
of each sensor variable separately. It can determine that many transitions for the
position-in-room variable are able to be modelled reliably without reference to the
room value. By focussing on the position-in-room variable, the only positions from
which results are unpredictable are where the robot exits a room. The robot there-
fore considers the sub-problem of how to leave each room via one of its exits.
6 5 4 3 2
5 4 3 2 1
4 3 2 1 0
5 4 3 2 1
6 5 4 3 2
6 5 4 5 6
5 4 3 4 5
4 3 2 3 4
3 2 1 2 3
2 1 0 1 2
6 5 4 3 2
5 4 3 2 1
4 3 2 1 0
5 4 3 2 1
6 5 4 3 2
6 5 4 5 6
5 4 3 4 5
4 3 2 3 4
3 2 1 2 3
2 1 0 1 2
1
1
Room 0
Room 1
Room 2
Figure 1.3: The maze, decomposed into rooms, showing the number of steps to reacheach room exit on the way to the goal.
It proceeds to learn separate value functions over just one room, in the same
way the value function was found previously for the whole maze. The room value
functions can be reused. For example, exiting the bottom left room to the East
Introduction 6
produces the same room values for each room position as exiting the top left room
to the east. Similarly, exiting the top right room to the South produces the same
values as exiting the top left room to the South. The position values for each of
these room exits are shown in figure 1.3.
Room 0 Room 1
Room 2
Figure 1.4: The maze, abstracted to a reduced model with only three states, onefor each room. The arrows indicate transitions for the abstract actions that are thepolicies for leaving a room to the North, South, East and West.
Having learnt room leaving policies, the original problem can be abstracted and
reduced to a simpler model with just three states, one for each room, as illustrated
in figure 1.4. The key property that makes this abstraction possible is that the value
to reach the goal after exiting each room is independent from the value inside each
room to reach the room exit.
For the robot to decide the best way to act in any location in the maze it must
compose the total value function by adding the cost components of the journey back
from the goal. For example, if it started in the very top left location there are two
different inter-room routes to reach the goal. In each case, the cost to reach the goal
after exiting the top left room is 4. Adding 1 unit of cost to hop into either of the
next two rooms and another 6 to reach a top left room exit, makes the total cost
11, just as it was previously. If instead the starting location is in the top right hand
corner of the top left room, there are still two direct paths to the goal via different
rooms. By adding up the costs this time, one will take 11 units and the other only
Introduction 7
7 units. In this way the robot can choose the shortest path, in this case, via the top
right room.
This example has demonstrated how policies and regions for sub-problems can
be abstracted and reused to solve the original problem. How can the decomposition
and solution process be automated?
1.3 HEXQ Automatic Decomposition
This section provides a broad overview of the HEXQ algorithm. The details will
become clearer in Chapters 4 and 5.
HEXQ approaches the automatic decomposition of MDPs by a variable-wise
search for Markov sub-space regions. The decomposition depends largely on the
choice of variables and the structure of the problem. Since variables that are asso-
ciated with the lower levels in a control hierarchy tend to change more frequently,
the variables are first sorted by their frequency of change using a purely random
exporation policy. During a random walk in the maze example, the room variable
will change less frequently than the position-in-room variable.
HEXQ first explores the behaviour of the most frequently changing variable. It
looks for state transitions for this variable (and rewards) that are probabilistically
predictable without any of the other variables changing value. The states of the
variable (its values) may be partitioned into regions that have internal Markov
transition and reward properties invariant of the values of the other variables. Such
regions in the maze in figure 1.1 are the rooms.
HEXQ flags all transitions where either other variables change value or the tran-
sitions (or rewards) are unpredictable4. These transitions are called exits, because,
when executed, the agent leaves a region5. It is a condition of a HEXQ decom-
position that all exits must be reachable from inside a region with probability one
4i.e. non-stationary5a region exit may leave a region and transition back to the same region
Introduction 8
without being forced to leave the region elsewhere. In the maze example, the exits
are the transitions through the doorways.
HEXQ constructs multiple sub-MDPs for a region, each with the sub-goal of
leaving the region via one of its exits. The different policies to leave each region are
cached.
A new abstracted MDP is formed at the next level. Its abstract states are
generated by taking the Cartesian product of the region identifiers with the values
of the next most frequently changing variable. Its abstract actions are the cached
policies for each region. This MDP is a semi-MDP because the abstract actions
usually operate over an extended period of time until they terminate with a region
exit. For the maze, this abstract MDP is shown in figure 1.4.
The above procedure can be repeated until only one variable remains defining a
top-level sub-MDP. Solving this sub-MDP solves the original MDP.
HEXQ uses a hierarchically decomposed value function. The value function for
any state is composed of the reward accumulated inside each region to reach an exit
sub-goal, plus the value to continue, represented at more abstract levels.
Opportunities to reduce models and compactly represent value functions present
themselves when
• the policies of one region can be reused,
• a whole class of identical regions represents repetitive sub-spaces,
• region states can be abstracted, and
• policy and state abstractions can be employed at multiple levels.
1.4 Contribution
The contributions of this thesis are:
Introduction 9
• An automatic decomposition algorithm for reinforcement learning problems.
The decomposition partitions the states into sub-problems that form smaller
reinforcement learning problems and are connected hierarchically to solve the
overall problem.
• A formulation of value function decomposition equations that generalises the Q
function to abstract MDPs. It allows automatic hierarchical credit assignment
as rewards that cannot be explained at lower levels are relegated to be modelled
at higher levels.
• A method for automatic discovery of sub-goals. Sub-goals are the region exit
states. They are generated naturally as part of the process of region discovery.
• A method for the abstraction of similar regions. Regions may be similar be-
cause they represent similar physical objects or the same object, in different
contexts or with different attributes. HEXQ automatically models similar re-
gions as one region class.
• A method for automatic and safe abstraction of both region policies and states.
A HEXQ decomposition guarantees safe abstraction of region states. The
algorithm is robust in the sense that, if the decomposition produces single
state regions, HEXQ effectively defaults to solving the “flat” problem.
• Proof of globally optimal solutions for HEXQ decomposed deterministic short-
est path problems and similar problems where only the top level sub-MDP is
stochastic.
• A method for safe state abstraction using a decomposed discounted value
function. This extends HEXQ to be able to tackle any finite MDP using a
discounted value function. The main innovation here is the introduction of
an additional and separate decomposed discount function working in concert
Introduction 10
with the decomposed value function to safely and compactly represent state
values in a discounted setting.
• HEXQ extensions for automatic hierarchical decomposition of infinite horizon
problems (problems that do not terminate), where good solutions may require
the agent to continue execution in a sub-task.
• An introduction to variable resolution and stochastic approximation tech-
niques over and above safe state abstraction that reduce the computational
complexity further with a controllable trade-off in solution quality. The control
over decision time complexity may provide an anytime execution capability.
• Outline of other future research directions including the construction of more
general hierarchies and training instead of programming agents.
These results make a contribution to the open problem of discovering hierarchical
structure in reinforcement learning. The benefit is that the agent may decompose its
environment based on its own experience, thus relieving a designer from performing
this task. In the best case, hierarchical decomposition leads to space complexity
linear in the number of variables used to describe the sensor state. Empirical eval-
uation testifies to the versatility of HEXQ. In one case a problem is easily solved in
seconds that would otherwise require billions of table entries and is intractable on
present day computers.
1.5 Outline of thesis
Chapter 2 reviews some Markov and semi-Markov decision processes formalisms
that are used in other chapters. This Chapter is supplemented with other
mathematical concepts and algorithms in Appendix A. These concepts are
considered basic and are available through a number of sources. The chapter
Introduction 11
can be skimmed for notation and basic assumptions that will help the reader
understand subsequent sections of the thesis.
Chapter 3 reviews background literature, examining:
• the benefit of a hierarchical approach to tackling complex problems,
• hierarchical reinforcement learning with manual decompositions,
• a number of approaches that automate decomposition and discover hier-
archical structure.
Chapter 4 focuses on the theory underlying the HEXQ algorithm. Importantly, it
formally defines the partition conditions used by HEXQ to decompose a multi-
dimensional MDP. The decomposition is explained using the simple maze ex-
ample introduced earlier. The chapter considers the issue of optimality and
shows how the decomposed value function can be represented compactly and
losslessly by abstracting both state and actions.
Chapter 5 describes the HEXQ algorithm, implementing in practice the theory
developed in Chapter 4. It explains how HEXQ explores its environment and
builds a reduced multi-level model of the original problem to find good policies.
Chapter 6 evaluates HEXQ empirically on a number of problems, illustrating its
characteristics. The problems include ones that other researchers have decom-
posed manually and some to test HEXQ in larger domains.
Chapter 7 introduces an additional supporting decomposed discount function to
allow safe state abstraction in the face of discounting. Abstraction of dis-
counted value functions allows the HEXQ algorithm to be extended to solve
infinite horizon problems in which the recursively optimal policy may require
a sub-task to persist.
Chapter 8 extends HEXQ by introducing approximations of the hierarchical value
function that further reduce computational complexity.
Introduction 12
Chapter 9 addresses some of the limitations of HEXQ and suggests a number of
potentially fruitful research directions. These include improvements to the
existing algorithm and tackling the larger problem of learning in a complex
environment where the agent’s sensor state is large, yet does not fully describe
the environment.
Chapter 2
Preliminaries
This chapter introduces basic notation and definitions for Markov and semi-Markov
decision processes. The material is generally available in introductory texts and is
not meant to be comprehensive. The literature on Markov decision processes and
reinforcement learning is extensive. Introductory texts include Puterman (1994)
and Sutton and Barto (1998).
Environment
State Reward Action
Agent
Figure 2.1: An agent interacting with its environment and receiving a reward signal.
The reinforcement learning framework comprises an agent that perceives the
environment (or domain) through sensors and acts on that environment through
effectors. The environment also produces a reward, a special numerical value that
the agent tries to maximise over time. The agent is taken to interact with its
environment at discrete time steps, t = 0, 1, 2, 3, . . .. At each time step it receives an
input representing some environmental state and a reward. It responds by taking
an action. The agent-environment interaction is depicted in figure 2.1.
13
Preliminaries 14
2.1 The Markov Property
As an agent takes actions, a, and observes the states, s, of the environment, the
history of states and actions up to time t is . . . , st−2, at−2, st−1, at−1, st, at. If the
probability of the next state, s′, is only dependent on the last state and action then
the state description is said to have the Markov property. This can be defined by
specifying Pr{st+1 = s′|st, at} for all s′, st and at.
Rewards similarly have the Markov property, when the probability of the next
reward value depends only on the last state and action Pr{rt+1 = r|st, at}.
2.2 Markov Decision Problems
Well established formal representations exist1 to describe an agent-environment in-
teraction as a Markov decision problem. In this thesis, Markov decision problems
are defined by a finite number of states st ∈ S and a finite number of actions,
at ∈ A(st), that are available in each state. At discrete time steps, given action at,
the system transitions from state st to state st+1. A bounded real reward, rt+1 ∈ <is given to the agent at each time step. The state transition and reward functions
are both assumed to have the Markov property. Further, they are both assumed to
be stationary functions, meaning they are independent of time. Formally:
A discrete time, finite Markov Decision Process (MDP) is a tuple 〈S, A, T, R, S0〉where
• S is a finite set of states, s ∈ S.
• A is a finite set of actions, a ∈ A, A =⋃
s∈S A(s).
• T is the one step probability function of transitioning from one state to the
1the notation largely follows Sutton and Barto (1998).
Preliminaries 15
next when taking an action.
T ass′ = Pr{st+1 = s′|st = s, at = a} (2.1)
This one step transition probability is stationary, meaning that it is indepen-
dent of time and is written more succinctly as Pr{s′|s, a}.
• R is a bounded reward function giving the expected reward on transition from
state s to the next state s′ after taking action a.
Rass′ = E{rt+1|st+1 = s′, st = s, at = a} (2.2)
The reward function is assumed to be stationary.
Rass′ = E{r|s, a, s′} =
∑r
r · Pr{r|s, a, s′} (2.3)
• S0 is the starting state distribution. This means that the MDP is initialized
in state s with probability S0(s)
In general MDPs can have an infinite number of states or actions. This thesis
considers only MDPs that have a finite number of states and actions.
A model in relation to an MDP refers to the state transition and reward functions.
When the state s ∈ S is described by a vector of d state variables: s =
(s1, s2, . . . , si, . . . , sd) where si is the ith state variable, the state s is said to have a
dimension of d. The associated MDP will be referred to as a d-dimensional MDP.
An MDP policy, π, is a mapping from states to possible actions at each time step.
When this mapping is probabilistic, the notation π(s, a) is the stationary probability
of taking action a in state s. The notation π(s) refers to the action chosen according
to the probability distribution, that is π(s) = a. A deterministic policy is one such
that for all s, π(s, a) = 1 for exactly one a.
Preliminaries 16
A deterministic policy is not to be confused with deterministic actions that en-
sure the next state is determined by the action, that is T ass′ = 1 or 0. A deterministic
policy may or may not involve deterministic actions. Deterministic rewards likewise
mean that Pr{r|s, a, s′} = 1 or 0. Deterministic transitions mean that both actions
and rewards are deterministic.
A Markov Decision Problem has an optimality criterion to maximise a value
function, usually some measure of future reward. Value functions may be based on
average reward, sum of rewards for a fixed number of time-steps, etc. Throughout
this thesis the value function is the commonly used sum of (discounted) future
rewards. In this case the value function for state s in MDP m with a policy, π, and
a discount rate, γ, is given by:
V πm(s) = E{rt+1 + γrt+2 + γ2rt+3....|st = s, π} (2.4)
sT-2 sT-1 sT
rT-1 rT
r = 0
Absorbing state
MDP terminates
Figure 2.2: Episodic MDP showing the transition on termination to a hypotheticalabsorbing state
Episodic MDPs are ones that eventually terminate with probability one in con-
trast to infinite horizon MDPs that may not. To unify the value function definition
for both cases an episodic MDP is modelled by assuming it enters a hypothetical
absorbing state on termination. All transitions from the absorbing state lead back
to that state with probability 1 and reward 0 as illustrated in figure 2.2.
Preliminaries 17
For infinite horizon MDPs the discount factor, γ, is in the range, 0 ≤ γ < 1.
For episodic problems γ can also be equal to 1. In this case the value function is
bounded as no further reward can accumulate after entering the absorbing state.
For an MDP m with a fixed policy π the value of state s is “backed up” from the
possible next states and can be written as the expected value of the next expected
reward together with the discounted value of the next state. This is the Bellman
equation:
V πm(s) =
∑
s′T
π(s)ss′ [R
π(s)ss′ + γV π
m(s′)] (2.5)
The optimal value function, V ∗m, maximizes the value function for all states s ∈ S
in MDP m with respect to π. Bellman proved that this is the unique solution to
the Bellman optimality equation:
V ∗m(s) = max
a
∑
s′T a
ss′ [Rass′ + γV ∗
m(s′)] (2.6)
This equation is similar to equation 2.5, except this time, instead of finding
the value of each state based on a given policy, the objective is to find a policy π,
designated by ∗, that maximises the value of each state.
The action-value function Qπ(s, a) is defined as the expected return starting in
state s, taking action a and following policy π thereafter. Qπ and optimal Q∗ are
defined in terms of V π and V ∗ respectively:
Qπm(s, a) = E{rt+1 + γV π
m(st+1)|st = s, at = a} (2.7)
Q∗m(s, a) = E{rt+1 + γV ∗
m(st+1)|st = s, at = a} (2.8)
and their respective Bellman equations are as follows:
Preliminaries 18
Table 2.1: Action-Value Iteration
function ValueIteration( MDP〈S, A, T,R〉, γ )
1. initialise Q(s, a) ← 0
2. repeat until 4 < small positive number
3. 4← 0
4. for each state s ∈ S
5. for each action a ∈ A
6. q ← Q(s, a)
7. Q(s, a) ← ∑s′ T
ass′ [R
ass′ + γV (s′)]
8. 4← max(4, |q −Q(s, a)|)9. end repeat
10. return Q(s, a)
end ValueInteration
Qπm(s, a) =
∑
s′T a
ss′ [Rass′ + γQπ
m(s′, π(s′))] (2.9)
Q∗m(s, a) =
∑
s′T a
ss′ [Rass′ + γ max
a′Q∗
m(s′, a′)] (2.10)
Dynamic programming is the usual way to solve an MDP when the model (the
state transition and reward functions) is known. Table 2.1 shows one of a class
of algorithms that returns the optimal action-value function Q∗(s, a). The optimal
value function can be calculated as V ∗(s) = maxa Q∗(s, a) and an optimal policy as
π∗(s) = arg maxa Q∗(s, a). This aglorithm was adapted from similar algorithms in
Sutton and Barto (1998). When the model is not known beforehand, it is not possible
to use dynamic programming directly, but there are several other algorithms, such
as Q-learning (Watkins and Dayan, 1992), that can both explore the state space and
simultaneously learn optimal value functions. The Q-learning algorithm in table 2.2
was adapted from Sutton and Barto (1998).
Preliminaries 19
Table 2.2: Q Learning
function Q-Learning
1. initialise Q(s, a) ← 0 and set the learning rate β
2. initialise s ← initial state
3. repeat until termination
4. choose action a using exploration policy derived from Q
5. take action a, observe r, s′
6. Q(s, a) ← (1− β)Q(s, a) + β[r + γ maxa′ Q(s′, a′)]
7. s ← s′
end Q-Learning
2.3 Semi-Markov Decision Problems
The actions in the abstracted maze (figure 1.4) in Chapter 1 are room leaving poli-
cies. These actions usually take a number of time steps to complete. In general these
temporally extended or abstract actions will be seen to be important in hierarchical
reinforcement learning as more abstract descriptions of problems use actions that
consist of a whole sequence of more primitive actions.
MDPs generalise to semi-MDPs in which actions can persist over a number of
time steps2. Denoting the random variable N to be the number of time steps that
an abstract action a takes to complete when it is executed starting in state s, the
state transition and reward functions for a semi-MDPs can be generalized. The joint
probability distribution of result state s′ reached in N time steps when action a is
executed in state s is:
TNass′ = Pr{st+N = s′|st = s, at = a} (2.11)
Similarly, the expected reward when state s′ is reached after N time steps taking
2This discrete formulation largely follows Dietterich (2000).
Preliminaries 20
action a in state s is:
RNass′ = E{
N∑n=1
γn−1rt+n|st = s, at = a, st+N = s′} (2.12)
The Bellman equations for the value functions for an arbitrary policy and optimal
policies are similar to those for MDPs with the sum taken with respect to s′ and N
using the joint probability distribution T :
V πm(s) =
∑
s′,N
TNπ(s)ss′ [R
Nπ(s)ss′ + γNV π
m(s′)] (2.13)
V ∗m(s) = max
a
∑
s′,N
TNass′ [RNa
ss′ + γNV ∗m(s′)] (2.14)
For episodic semi-MDPs with the discount factor γ set to 1, the joint probability
distribution with respect to N and s′ can be taken just with respect to s′, the
state reached on termination of the abstract action a after any number of steps. In
this case the Bellman equations are similar to the ones for MDPs with the expected
primitive reward replaced with the expected sum of primitive rewards to termination
of the abstract action.
T ass′ = Pr{s′|s, a} =
∞∑N=1
TNass′ (2.15)
Rass′ = E{RNa
ss′ |s, a, s′} (2.16)
=∞∑
N=1
TNass′ RNa
ss′
T ass′
V πm(s) =
∑
s′T
π(s)ss′ [R
π(s)ss′ + V π
m(s′)] (2.17)
Preliminaries 21
V ∗m(s) = max
a
∑
s′T a
ss′ [Rass′ + V ∗
m(s′)] (2.18)
MDPs and their generalisation to semi-MDPs are the basis for much of the
background and related work covered in the next Chapter and the formalism to
describe HEXQ decomposition in later Chapters.
Chapter 3
Background
This chapter reviews literature to position and motivate the thesis.
Two methods for scaling up reinforcement learning algorithms are function ap-
proximation and hierarchical decomposition. As the name implies, function ap-
proximation is aimed at approximating and thereby compacting a value function.
Hierarchical approaches use structure in the representation to try to compact the
representation and have the potential, in the best case, to reduce the exponen-
tial growth in the size of the state space to linear in the number of variables. These
methods are not mutually exclusive and function approximation may be used within
hierarchical representations. After a brief introduction to function approximation,
the chapter will focus on hierarchical approaches. For a more general review of
reinforcement learning the survey by Kaelbling et al. (1996) is recommended.
3.1 Function Approximation
Function approximation is used to represent a value function compactly. A state
description is mapped to a value of the function. The mapping may be fixed by
the designer or parameterised, with the parameters updated during the learning
process. In the latter case, function approximation can be viewed as supervised
learning (Sutton and Barto, 1998).
22
Background 23
There are many examples of function approximation in reinforcement learning.
Some early examples of function approximation include Samuel’s famous checker
player and Michie and Chambers (1968) BOXES algorithm1. For the checker player,
the board state value function may be approximated by a linear combination of
weighted features such as for example the size of the disk advantage. Anderson
(1986) used an artificial neural network to approximate the value function for a pole
balancer. Santamaria et al. (1998) experimented with various techniques such as tile
coding, instance and case based methods in continuous state spaces. In their book
Sutton and Barto (1998) include a good introduction to function approximation and
some case studies. While function approximation has been used successfully in many
cases, caution is required as convergence of the value function is not guaranteed
for all generalisations. Defining the class of functions and conditions to ensure
convergence is an active research topic.
Model minimisation (Dean and Givan, 1997) can be interpreted as a type of func-
tion approximation. Approximation is a misnomer in this case, as the value function
is compacted without loss of accuracy. The states of an MDP are partitioned into
blocks such that the states in each block behave in the same way. States are re-
quired to have the same probability and reward function for transitions to other
blocks and hence all states in the one block have the same value. Dean and Givan
(1997) use Bayesian networks (Pearl, 1988) to implicitly encode variable dependen-
cies. Model minimisation is similar to the state-space abstraction by Boutilier and
Dearden (1994) and compacts a value function without loss. Ravindran and Barto
(2002) generalise model minimisation to symmetries in MDP models.
Model minimisation is implicitly applied to Markov sub-space regions as one
type of state abstraction in HEXQ, as will become apparent.
1the origins of which date back to Michie’s 1961 “MENACE” (Matchbox Educable Noughtsand Crosses Engine) “computer” built out of matchboxes and beads (Michie, 1986).
Background 24
3.2 Hierarchical Methods
Hierarchical methods rely on decomposing a problem into smaller parts. The so-
lution involves multiple levels or stages of decision making that together solve the
whole problem. To apply a hierarchical approach someone has to decompose the
problem. This task is usually left to the designer of the algorithm, however, many
researchers point to the desirability of learning decompositions automatically.
The rest of this chapter will focus on hierarchical methods and explore three
themes:
Multi-level Learning. There is much support for the idea that learning in multi-
level stages has the capability to overcome the “curse of dimensionality”.
Learning common sub-problems may be leveraged in subsequent learning. If
the solutions to the sub-problems are used to inductively bias the learning at
the next level, the search space can be reduced. Iteration over multiple levels
creates a scaffolding effect with potentially multiplicative effects for scaling
learning algorithms in complex environments.
Hierarchical Reinforcement Learning. Hierarchical reinforcement learning is
commonly structured using a gating mechanism that learns to switch in one of
a number of more detailed policies as abstract actions. Gating mechanisms can
be cascaded to produce multiple levels of control. Recently proposed frame-
works have much in common and all use a semi-MDP formalism to model
abstract actions.
Automatic Decomposition. This section reviews various approaches to auto-
mate the decomposition of problems. The selection of reviews has not been
limited to reinforcement learning. The general thrust is to automate the
discovery of components of hierarchical learners such as identifying Markov
reusable sub-regions and identifying useful sub-goals.
Background 25
3.3 Learning Hierarchical Models in Stages
Many researchers have come to the conclusion that learning in stages is necessary
to learn effectively in complex environments.
Simon (1996) describes hierarchical systems as ones composed of interrelated
sub-systems, each in turn being hierarchical in structure until the lowest level of
elementary sub-systems is reached. What is elementary is relative and the subject
is broached again in Chapter 8 with variable resolution models. The lowest level of
resolution available to agents is their sensor-effector interface with their environment.
Why hierarchy? Simon (1996) notes that empirically, a large proportion of com-
plex systems seen in nature, exhibit hierarchical structure. On theoretical grounds,
hierarchy allows the complex to evolve from the simple. From a dynamical view-
point, hierarchical systems have the property of near decomposability, simplifying
their behaviour and description.
In his analysis of the evolution of complex systems, Simon comes to the conclu-
sion that, “Complex systems will evolve from simple systems much more rapidly if
there are stable intermediate forms”.
Ashby (1952, 1956) talks about amplifying the regulation of large systems in
stages. For example, a manager looks after mechanics who look after air-conditioners.
Ashby calls these hierarchical control systems “ultra-stable”. The mechanics are
hired and fired based on performance and they in turn replace parts or the whole
machine in case of malfunction or other changes. One feature of this type of dy-
namic hierarchy is that systems at higher levels tend to make decisions on longer
time scales. The advantage is that the regulatory load is reduced for the supervising
system and the sub-system is given time to adapt to any changes. Ashby’s conclu-
sion, “... the provision of a small regulator at the first stage may lead to the final
establishment of a much bigger regulator so that the process shows amplification.”
Among the basic properties of complex adaptive systems, Holland (1995) lists (1)
aggregation and (2) building blocks. Aggregation refers to forming categories or se-
Background 26
lecting salient features and in a second sense aggregating behaviours of sub-systems.
Building blocks capture repetition in models. They serve to impose regularity in a
complex world. Holland’s conclusion, “We gain significant advantage when we can
reduce the building blocks at one level to interactions and combinations of building
blocks at a lower level.”
In The Society of Mind, Minsky (1985) phrases it this way, “Achieving a goal
by exploiting the abilities of other agencies [...] is the very power of societies. No
higher-level agency could ever achieve a complex goal if it had to be concerned with
every small detail [...]”
In his famous paper on six different representations for the missionaries and
cannibals problem, Amarel (1968) showed foresight into the possibility of making
machine learning easier by discovering regularities and subsequently using them for
formulating new representations.
Clark and Thornton (1997) present a persuasive case for modelling complex
environments (type-2 problems in their language), namely, that it is necessary to
proceed bottom up and solve type-1 problems (tractable as originally coded) as
intermediate representations. In their words, “[...] the underlying trick is always
the same; to maximise the role of achieved representations, and thus minimise the
space of subsequent search”.
Stone (1998) advocates a layered learning paradigm to complex multi-agent sys-
tems in which learning a mapping from an agent’s sensors to effectors is intractable.
The principles advocated include problem decomposition into multi-layers of ab-
straction, learning tasks from the lowest level to the highest in a hierarchy where
the output of learning from one layer feeds into the next layer.
More recently Utgoff and Stracuzzi (2002) point to the compression inherent in
the progression of learning from simple to more complex tasks. They suggest a
building block approach, designed to eliminate replication of knowledge structures.
Agents are seen to advance their knowledge by moving their “frontier of receptivity”
Background 27
as they acquire new concepts by building on earlier ones from the bottom up. Their
conclusion, “Learning of complex structures can be guided successfully by assum-
ing that local learning methods are limited to simple tasks, and that the resulting
building blocks are available for subsequent learning.”
Other examples that support this view include (1) Sammut’s (1981) Marvin, a
program that first learns simple concepts that are then used to learn the descrip-
tion of more complex concepts; (2) constructive induction - any form of induction
that generates new descriptors not present in the input data (Dietterich and Michal-
ski, 1984) and (3) structured induction (Shapiro, 1987) in which the user guides a
machine to learn sub-concepts that are used to find the solution to more complex
problems.
There is an important idea common to all these examples. It appears that one
way to scale learning algorithms in complex environments is to learn in stages,
starting with simple concepts for repetitive situations. These simple concepts are
used as inductive bias at higher levels, building a multi-layered abstraction hierarchy.
This principle underlies the work on hierarchical reinforcement learning in this thesis.
3.4 Hierarchical Reinforcement Learning
This section reviews hierarchical approaches for manually decomposed problems
to provide the orientation for the later discussion on more automatic methods of
decomposition.
Hierarchical reinforcement learners can be viewed as gating mechanisms that, at
a higher level, learn to switch in appropriate and more reactive behaviours at lower
levels.
Ashby (1952) proposed just such a gating mechanism (see figure 3.1) for an agent
to handle recurrent situations. Even at this time it was envisaged that the switched-
in behaviours can be learnt adaptions to repetitive environmental “disturbances”.
Background 28
Figure 3.1: Ashby’s 1952 depiction of a gating mechanism to accumulate adaptionsfor recurrent situations.
The behaviours are accumulated and switched in by “essential” variables “working
intermittently at a much slower order of speed”. The gating mechanism and the
lower level behaviours can of course be hard-coded by a designer and do not have
to be learnt. Robots using a subsumption architecture (Brooks, 1990) provide an
example. Ashby (1956) also recognised that there is no reason that the gating
mechanism should stop at two levels and that the principle could be extended to
any number of levels of control.
Watkins (1989) is often referenced for his contribution to reinforcement learning
of the Q action-value function which allows incremental model free learning (without
having to explicitly store the transition or reward functions). In his thesis he also
discusses the possibility of hierarchical control consisting of coupled Markov decision
problems at each level. In his example, the top level, like the navigator of an 18th
century ship, provides a kind of gating mechanism, instructing the helmsman on
which direction to sail. Watkins did not implement a hierarchical control system in
his thesis, but did raise some pertinent observations and issues about hierarchical
reinforcement learning: (1) at higher levels, decision are made at slower rates, (2)
there is a distinction between global optimality and optimality at each level, (3)
Background 29
the possibility of learning concurrently at different levels and (4) different reward
structures may be necessary at each level.
Singh (1992) developed a gating mechanism called Hierarchical-DYNA (H-DYNA),
an extension to DYNA (Sutton and Barto, 1998). DYNA is a reinforcement learner
that uses both real and simulated experience after building a model of the reward
and state transition function. H-DYNA first learns elementary tasks such as to
navigate to specific goal locations. Each task is treated as an abstract action at a
higher level of control and is able to be switched in by a gating mechanism. The
gating mechanism itself is a reinforcement learner. For example, the elementary
tasks may be to navigate to two separate positions A and B. Once each elementary
task has been learnt, in order to learn the composite task (go to A first then B),
H-DYNA only needs to learn to switch in the abstract actions navigate-to-A fol-
lowed by navigate-to-B. Treating sub-tasks as abstract actions is a common theme
in hierarchical reinforcement learning.
In their “feudal” reinforcement learning algorithm, Dayan and Hinton (1992)
emphasise another desirable property of hierarchical reinforcement learners - state
abstraction. They call it “information hiding”. It is the idea that decision mod-
els should be constructed at coarser granularities further up the control hierarchy.
Their learner is essentially a multi-level gating mechanism, described as a strict
management hierarchy, where managers have sub-managers that work for them and
bosses that they work for in turn. Reward is administered by each level manager
to the subordinate agent depending on whether the instructions were carried out
successfully, rather than whether the manager succeeded or failed. Feudal learning
initially takes longer to improve its performance and then learns more rapidly than
ordinary learners, as higher level managers take advantage of the acquired skills of
their subordinates in new situations.
In the hierarchical distance to goal (HDG) algorithm, Kaelbling (1993) intro-
duced the important idea of composing the value function from distance components
Background 30
along the path to a goal. HDG is modelled on navigation by landmarks. The idea is
to learn and store local distances to neigbouring landmarks and distances between
any two locations within each landmark region. Another function is used to store
shortest-distance information between landmarks as it becomes available from local
distance functions. The HDG algorithm aims for the next nearest landmark on the
way to the goal and uses the local distance function to guide its primitive actions. A
higher level controller switches lower level policies to target the next neighbouring
landmark whenever the agent enters the last targeted landmark region. The agent
therefore rarely travels through landmarks but uses them as points to aim for on its
way to the goal. Interruption of the sub-tasks, before their completion, in order to
improve the policy, is an important idea that is later developed by Dietterich (2000)
as hierarchical greedy execution. When the goal state is in its current landmark
region, the agent goes directly to the goal position. Savings are achieved by com-
posing a higher level distance function from local distance functions, to navigate
between landmarks. Moore et al. (1999) have extended the HDG approach with the
“airport-hierarchy” algorithm. This extension is discussed further below.
Some key features from the above references may be summarised as follows:
• the abstraction of learnt sequences of more primitive actions as a single be-
haviour or sub-skill.
• the use of gating mechanisms at higher levels that learn to switch in sub-skills
to achieve their own ends.
• state abstraction to simplify the model of the environment thereby reducing
the resolution of the state description to coarser levels of granularity.
• decomposition of the value function of the total task into the sum of the
separate value functions of the higher and lower level tasks.
These four elements are inter-related. Higher levels of the control hierarchy
are associated with higher levels of both temporal and state abstraction. From
Background 31
the gating mechanism’s perspective the lower level behaviours may persist over a
number of time steps before control is returned. When the gating mechanism is itself
implemented as a reinforcement learner it is natural to use semi-Markov decision
theory to model the extended nature of its decisions (abstract actions).
The next three examples will look at more recent developments in hierarchical
reinforcement learning building on the commonalities so far.
3.4.1 Options
Sutton et al. (1999a) use the term option to model abstract actions. An option
is described by three components: the set of states from which the option can be
invoked, the policy followed by the option while it is executing and the probability of
the option terminating in any state. A higher level controller can decide to initiate an
option in any of its invoking states. Once the option is started it follows its policy and
stochastically terminates, whereupon control is returned to the controller, allowing
it to select another option. Options are abstract actions in a semi-Markov decision
processes. Sutton et al. (1999a), McGovern (2002) develop the option formalism
that extends reinforcement learning from primitive actions to options in a natural
way to enable learning optimal option polices.
Goal
Figure 3.2: The maze from Chapter 1 reproduced here for convenience.
The authors show, for example, how options can learn faster proceeding on a
Background 32
room by room basis, rather than position by position, in a similar rooms environment
to that shown in figure 3.2. When the goal is not in a convenient location, able
to be reached by the given options, it is possible to include primitive actions as
special case options and still accelerate learning to some extent. For example, with
room leaving options alone, it is not possible to reach a goal in the middle of the
room. Primitive actions are required when the room containing the goal state is
entered. Although the inclusion of primitive actions guarantees convergence to the
globally optimal policy, this may create extra work for the learner. It is not clear
that options alone provide significant advantages (McGovern and Sutton, 1998)
over reinforcement learning acceleration techniques such as eligibility traces (Sutton
et al., 1999a) and prioritised sweeping (Moore and Atkeson, 1993). The answer
may well lie in tailoring options to meet specific sub-goals and giving them greater
priority. Unless options include primitive actions, a globally optimal policy cannot
be assured. With primitive actions the load on higher levels increases and it is
difficult to achieve scaling.
3.4.2 HAMs
Parr (1998) also uses the semi-MDP framework to model abstract actions. He
reformulates an MDP as a Hierarchy of Abstract Machines (HAMs). In the HAM
approach, policies in the overall MDP are constrained by defining a stochastic finite
state controller as a program that produces actions as a function of the agent’s
sensor state. The underlying abstract machines have (1) action states that specify
an action to be taken in the current environment state, (2) call states that can
execute another machine as a subroutine, (3) stop states that return control to the
calling machine and (4) choice states where action choices are learnt to optimise the
value function.
For example, to tackle the room navigation problem in figure 3.2, a machine
could be specified to move the agent in one of four diagonal directions. If each room
Background 33
corner is defined as a choice state to switch direction, this machine could learn to
reach the goal. In this example the base level states can be reduced from 75 to 12 by
removing the uncontrolled or non-choice states from the original problem, forming a
semi-MDP. Clearly the quality of the solution of the original MDP depends heavily
on the specification of the underlying machine for the HAM. Global optimality
guarantees are not possible.
On state aggregation Parr (1998) says, “One of the unsatisfying aspects of the
more formally justifiable state aggregation methods is that they fail to capture much
of the intuitive notion of an abstract state”. People reason about objects like rooms
as if they were single states, yet HAMS are unable to achieve this. This issue is
addressed with algorithms like MAXQ (and HEXQ) where, for example, the three
rooms from figure 3.2 are treated as single abstract states.
Andre and Russell (2002) have recently built on this approach with ALisp, a
programming language that effectively extends the HAM approach. They have also
introduced function decomposition and state abstraction along the lines of MAXQ by
extending the decomposition with a three part equation that ensures hierarchically
optimal solutions for reusable sub-tasks.
3.4.3 MAXQ
Dietterich (2000) formalises an approach to hierarchical reinforcement learning,
called MAXQ, which incorporates all the aforementioned elements. With MAXQ
an episodic MDP is manually decomposed into a hierarchical directed acyclic graph
of sub-tasks. This structure of sub-tasks is called a MAXQ graph and has one top
(root) sub-task. Each sub-task is a smaller (semi-)MDP. In decomposing the MDP
the designer specifies the active states and terminal states for each sub-task. Ter-
minal states are typically classed either as goal terminal states or non-goal terminal
states. Using pseudo-reward disincentives for non-goal terminal states, policies are
learnt for each sub-task to encourage them to terminate in goal terminal states. The
Background 34
actions defined for each sub-task can be primitive actions or other (child) sub-tasks.
Each sub-task can invoke any of its child sub-tasks as abstract actions. When a
task enters a terminal state it, and all its children, abort and return control to the
calling sub-task.
MAXQ has a number of notable features. MAXQ represents the value of each
state in a sub-task as a decomposed sum of completion values. A completion value is
the expected discounted cumulative reward to complete the sub-task after taking the
next (abstract) action. For the maze in figure 3.2, assume that a designer decomposes
the problem with the root sub-task terminating when the goal is reached and with
four child sub-tasks defined with terminations at each room exit to the North, South,
East and West. To calculate the value of the very top left location, for example, the
minimum completion value after the abstract action to leave the room to the East
or South terminates, is 4. The minimum completion value for each lower sub-task
after the next primitive action towards the room exit terminates is 5. Note that
the room leaving abstract actions are the child tasks invoked by the root task. The
completion value for a primitive action is the expected next primitive reward value.
Therefore the total value of the very top left location is also 11, but composed as
1 + 5 + 5.
Another feature of MAXQ is that it allows sub-task policies to be reused in
different contexts. The price to pay for this reuse is that the internal policies are
not attuned to each external situation and may be sub-optimal when viewed from
outside. Dietterich (2000) illustrates this well with a room that has two separated
goal terminal states representing doorways. The optimal policy for this sub-task
causes the agent to exit via the nearest doorway. If in a particular context one
doorway presents a more desirable exit, the MAXQ agent has no way of knowing
this and will not favour the more desirable exit. In this context the room leaving
policy is sub-optimal. The MAXQ solution is predicated on isolating each sub-task
policy in this way and parent sub-tasks use the isolated optimal policies of their child
Background 35
sub-tasks. Dietterich calls this a recursively optimal solution to the overall MDP
in contrast to a hierarchically optimal solution in which each sub-task is optimised
given its context.
The final MAXQ feature to be highlighted is the opportunity for state abstrac-
tion. State abstraction is key to reducing the storage requirements and improving
the learning efficiency. State abstraction means that multiple states are aggregated
and one completion value stored for the aggregate state in a sub-task. Dietterich
(2000) identifies five conditions for safe state abstraction. State abstractions are
“safe” when the decomposed value functions for all base level states are identical
before and after abstraction. Without state abstraction the learning efficiency of a
hierarchical decomposition can be worse than that of a flat2 learner (see the MAXQ
performance (Dietterich, 2000) on Kaelbling’s 10× 10 maze).
As will be seen in subsequent chapters, safe state abstraction is an integral feature
of HEXQ as it uncovers hierarchical structure in MDPs. The limitation on safe state
abstraction in HEXQ for discounted value functions will be removed in Chapter 7
with the introduction of a supporting on-policy discount function.
It should be noted that MAXQ does not include the reward on completing the
(abstract) action as a part of its completion value. Instead MAXQ includes the
reward on exit as a part of the internal value of a sub-task. This thesis will argue
that rewards on sub-task exit are more naturally explained outside a sub-task and
should not be included in the internal value of a sub-task.
Determining the value of a state or finding the next best (greedy) action to
execute requires a depth first search through the completion values of the sub-tasks
in the MAXQ graph. Large branching factors and in particular deep hierarchies can
be computationally expensive. This issue will be addressed for HEXQ in Chapter
8. The solutions have relevance for MAXQ.
2To distinguish a normal MDP from a hierarchically decomposed structure the former is referredto as flat.
Background 36
3.4.4 Optimality of Hierarchical Structures
The approaches to hierarchical reinforcement learning cited above cannot provide
any guarantees on how close the hierarchical solution is to the optimal solution of
the original MDP when the designer imposes constraints to simplify the problem.
In each of Options, HAMs and MAXQ, the sub-task policies are usually constrained
artificially by the programmer. Options and HAMs have been shown to be hierar-
chically optimal. This means the solution is optimal given the constraints of the
hierarchy. However, hierarchical optimality can be arbitrarily bad. If a designer
chooses a poor underlying machine for the HAM, the overall solution may be hier-
archically optimal but globally poor. Recursive optimality can be arbitrarily worse
that a hierarchically optimal solution as in this case further sub-optimality is intro-
duced by ignoring the context of the sub-task.
In each of the above approaches the hierarchically decomposed problem is ex-
ecuted by invoking abstract actions, possibly recursively, and running each action
until termination of the sub-task. Dietterich calls this hierarchical execution. Ter-
mination is defined stochastically for options, by choice states in HAMs and by
termination predicates in MAXQ. The stochastic nature of MDPs means that the
condition under which an abstract action is appropriate may have changed after the
action’s invocation and that another abstract action may be a better choice. How-
ever, the sub-task policy is locked-in until termination. This sub-optimal behaviour
can be arrested by interrupting the sub-task, as for example in the HDG algorithm
(Kaelbling, 1993). Dietterich calls the procedure hierarchical greedy execution when
each sub-task is interrupted after each primitive action step. The next best action
is recomputed from the top of the task hierarchy. While this is guaranteed to be
no worse than the hierarchically optimal or recursively optimal solution and may be
considerably better, it still does not provide any global optimality guarantees. The
policy constraints imposed by the sub-task and the hierarchical structure may be
such that a globally optimal policy cannot be executed.
Background 37
Interestingly, Dean and Lin (1995) and Hauskrecht et al. (1998) do provide de-
composition and solution techniques that make optimality guarantees, but unfortu-
nately, unless the MDP can be decomposed into very weakly coupled smaller MDPs
the computational complexity is not necessarily reduced. Dean and Lin recom-
pute the region sub-MDPs and the higher level semi-MDP iteratively. Hauskrecht,
Meuleau, Kaelbling, Dean, and Boutilier suggest a set of policies over each sub-
space that cover all combinations of possible external values to within a defined
error (ε-grid approach). They point out that the number of policies required can
be extremely large, even when the external state values are bounded. In turn Parr
(1998) proposes techniques to reduce the policy cache and still retain optimality
guarantees. These techniques require some overhead and can still produce large
policy caches. The other major drawback is that they do not facilitate state ab-
straction. It is not possible in general to abstract whole sub-space regions as the
transitions at the abstract level would not be Markov, but depend on the history,
in particular, on the way the region was entered.
The selection of the policy cache for sub-spaces is a key issue that can trade
off computational complexity against optimality. Re-solving the sub-space value
function iteratively and the ε-grid approach are solutions to MDP decomposition
with optimality guarantees. Constraining termination to handcrafted termination
predicates, as in MAXQ, can minimise the size of the cache, but does not provide any
guarantees. Hauskrecht et al. (1998) suggested and dismissed a heuristic to generate
one policy for each peripheral state of a region, reducing the cache size to d, the
number of peripheral states. The problem is that, once a policy is invoked it will
stubbornly try to reach a peripheral state even though it may have drifted closer
to another, more preferable peripheral state. This issue can largely be overcome
in practice using hierarchical greedy execution (Dietterich, 2000), as the next best
primitive step to take is re-evaluated at each step and takes into consideration
stochastic drift.
Background 38
As mentioned previously, Sutton et al. (1999a) allow primitive actions as op-
tions to guarantee optimality, but again there is no evidence that computational
complexity can be reduced in general.
A variation on the one policy per peripheral state approach is used by HEXQ to
generate a policy cache for each sub-space. While not generally making optimality
guarantees, this approach provides the right conditions for the safe abstraction of
sub-spaces. HEXQ is hierarchically optimal in general and globally optimal for
deterministic transitions when using undiscounted value functions. For stochastic
transitions, HEXQ is globally optimal if the stochasticity is limited to the top level
sub-MDP.
3.5 Learning Hierarchies: The Open Question
In the above approaches the programmer is expected to decompose the overall prob-
lem into a hierarchy of sub-tasks. The programmer must craft appropriate sub-goals
and sub-task termination conditions, decide on safe state abstraction, allocate re-
ward hierarchically or program stochastic finite state controllers with the right choice
points. Any progress towards the aim of training rather than programming agents to
achieve goals in complex environments will require that agents themselves must find
ways of learning and revising their own task hierarchies based on their experience.
The motivation for discovering hierarchy in reinforcement learning is well stated by
Dietterich (2000): “It is our hope that subsequent research will be able to automate
most of the work that we are currently requiring the programmer to do”.
To tackle the task of automating the construction of hierarchies it is necessary
to find ways of identifying the sub-system components, finding the right level of
abstraction at each level and to interface the sub-tasks at each level. The next
subsections will review some approaches to this challenge.
Background 39
3.5.1 Bottleneck and Landmark States
One way to automate the decomposition of a multi-task problem is to find states
that are visited frequently to solve each of the tasks. These states can then become
landmarks or sub-goals that an agent can use to solve larger problems. For example
in the maze of figure 3.2, if the agent is started in a random location in the top left
room it necessarily needs to exit via one of the two doorways on the way to the goal.
The states adjacent to each doorway will be visited more frequently in successful
trials than other states. Both Digney (1998) with the NQL (nested Q learning)
algorithm and McGovern (2002) use this idea to identify sub-goals. Menache et al.
(2002) discover bottleneck states by finding state space cuts where the flow properties
of the state transition graph discovered by the agent are minimal. The agent first
learns a sub-policy to reach the sub-goal. The sub-policy is reused to accelerate the
learning of other goals that have the bottleneck sub-goals as intermediate points.
Digney (1998) suggests using high reinforcement gradients as distinctive areas to
indicate useful sub-goals. Singh (1992) uses landmark states provided for some tasks
as sub-goals to solve composite tasks. He proposes the identification of interesting
landmark states that would make useful sub-goals. Interestingly, Kaelbling (1993)
and later Moore et al. (1999) suggest that, for navigation tasks, performance is
insensitive to the position of landmarks and an (automatic) randomly-generated set
of landmarks does not show widely varying results from more purposefully positioned
ones.
3.5.2 Common Sub-spaces
Rather than looking for intermediate points, another approach is to look for common
behaviour trajectories or common region policies. Thrun and Schwartz (1995) use
the SKILLS algorithm to find policies by growing or contracting seeded sub-space
regions. The idea is to use the partial policies over each sub-space generated from a
variety of tasks to select ones that maximise performance and minimise description
Background 40
length over a number of tasks. The selected skills constrain the policies available
when learning new tasks. McGovern (2002), in a second method to discover ab-
stract actions, searches for common sub-sequences of actions in successful trials of a
learning agent. She also suggests looking for observation sequences and notes that
these may be generalised to unseen situations if sensory readings remain consistent
across tasks. This last point is pertinent to HEXQ which generalises on this basis.
Drummond (2002) detects and reuses metric sub-spaces in reinforcement learning
problems. He delimits each sub-space using the value function gradient between
neigbouring states. A high gradient means that there is an impediment such as
a wall. The doorways into and out of sub-spaces are also detected by the value
function gradient in that an exit is a local maximum and an entry point is a local
minimum. Sub-space value functions are stored and indexed by the nodes of their
fitted polygon. They are compared to new situations using sub-graph matching
and adapted using transformations on their value function. The end result is that
an agent can learn in a similar situation much faster after piecing together value
function fragments from previous experience.
3.5.3 Multi-dimensional States
When the problem state is perceived as a vector of features, sub-systems based
on a subset of these features are good candidates for decomposition. This idea is
exploited by Knoblock (1994) in ALPINE (a hierarchical extension to PRODIGY)
which implements an automated approach to generating abstractions in planning.
In planning, the problem is to find a path from an initial state to a goal state.
The state transition function is defined by operators with preconditions and effects.
A state must meet the preconditions before the operator can be applied and the
effects, usually in the form of add and delete lists, describe the changes to the state
when the operator is applied. A solution to the problem is a sequence of operators
(a policy) that transitions from the initial state to the goal state. Planning is
Background 41
thus closely related to deterministic shortest path problems, a sub-class of MDPs.
By dropping terms, ALPINE looks for an abstract representation of the original
problem. It attempts an automatic hierarchical decomposition by searching for
partitions such that the solution at the abstract level is not affected by the solution at
the detailed level. For example, in the Towers of Hanoi problem, once the largest disc
is placed it need not be disturbed by the placement of smaller discs. While ALPINE
is a hierarchical planner there are similarities with the hierarchical reinforcement
learner HEXQ. The ordered monotonicity property, that ensures literals established
at abstract levels are not changed by refined plans in ALPINE, is related to the
nested Markov ordering requirement in CQ (Hengst, 2000) a precursor algorithm to
HEXQ. The automatic decomposition of the Towers of Hanoi problem with HEXQ
will be covered in Chapter 6.
3.5.4 Markov Spaces
Dayan and Hinton (1992) provide a multi-level partition of the state space with an
increase in resolution between the levels. They show how learning can be improved
by reusing local controllers at the different levels.
Moore (1994) uses the Parti-game algorithm to automate the partitioning of the
state space based on whether the agent becomes “stuck” and fails to move to a
neighbouring cell. By splitting local cells in the stuck region, the state space is
redefined to provide a better Markov representation that can learn to reach the goal
without failure. While this approach is of interest in automatically finding Markov
regions by decomposing the state space, the variable resolution Parti-game algorithm
is not a hierarchical controller in the sense characterised in this thesis. There are no
gating mechanisms at different levels. The algorithm does evolve through different
levels of resolution of controllers, but coarser versions are discarded in favour of finer
variable resolution models.
The UTree algorithm (McCallum, 1995) can be seen in a similar light. UTree
Background 42
increases the resolution of the model by iteratively splitting both on the state space,
or the state and action history, to uncover hidden state. The splitting criteria
for UTree is that the states are Markov with respect to the value function (Utile
distinctions).
Uther (2002) uses a similar tree based method (TTree) to increase the resolution
of abstract states when abstract action “trajectories” initiated in these states give
varying results. TTree is hierarchical in the sense that the trajectories are either
default or user provided fixed policies that are switched in at the abstract level and
continue to execute until termination.
The common theme in these algorithms is that they each increase the resolu-
tion of sub-spaces in an effort to attain predictability (Markov property) in the
model. This will be seen to be an important criteria for HEXQ as sub-space region
boundaries are defined where the Markov property breaks down.
The idea of refining a learnt model based on unexpected behaviour is also devel-
oped by Ryan and Reid (2000). Here a hierarchical model, RL-TOPs, is specified
using a hybrid of planning and MDP models. Planning is used at the abstract level
and invokes reactive planning operators, extended in time, based on teleo-reactive
programs (Nilsson, 1994). These operators use reinforcement learning to achieve
their post-conditions as sub-goals. If the operators are specified too abstractly they
may have negative side-effects and this may ultimately cause failure of the planner to
reach its goal. This is amply demonstrated by Ryan (2002) using the taxi with fuel
task (Dietterich, 2000). If the agent is not told about fuel usage during navigation
it can run out of fuel and fail to deliver passengers. The RL-TOPs planner tries to
uncover hidden state using ILP based on records of +ve and -ve examples. Again,
as in Parti-game, UTree and TTree, the aim is to automatically refine the abstract
state to a description that is Markov with respect to primitive and abstract actions.
HEXQ results for the taxi with fuel problem will be discussed in Chapter 6.
HEXQ discovers the side effect of running out of fuel, constructs the appropriate
Background 43
hierarchy of sub-MDPs and solves this problem optimally with hierarchical greedy
execution.
3.5.5 Other Approaches to Discovering Hierarchy
There is other related research that can be interpreted as discovering hierarchy. For
example, Harries et al. (1998) extract hidden context from training examples. “Con-
text” is another way of looking at the durative setting of the switching mechanism at
a higher level of abstraction. The learner tries to automatically uncover the settings
in time, by interactively sweeping the training set to minimise error rate for each
context period.
De Jong (2002) use co-evolution and genetic algorithms to discover common
building blocks (intermediate level abstractions) that can be employed to represent
larger assemblies (higher level abstractions). The modularity, repetitive modularity
and hierarchical modularity bias of their learner is closely related to the state space
repetition and abstraction used by HEXQ.
Another interesting approach that automates decomposition, specifically for prob-
lems where each “location” is also a goal, is the multi-value-function airport hier-
archy algorithm by Moore, Baird, and Kaelbling (1999). This algorithm builds on
Kaelbling’s original HDG algorithm (Kaelbling, 1993). The airport algorithm relies
on a heuristic to generate successive (and denser) levels of landmark states (air-
ports) in such a way that each region around an airport overlaps higher level airport
locations. In this way the total state space is not partitioned in the mathematical
sense but covered by a decreasing but overlapping set of MDP regions. The regions
are exponentially reduced in size as they grow in number. Ultimately every location
becomes an airport. Policies to reach a region airport are cached at a primitive level
for each region. It is then possible, given any starting location to dynamically com-
pute the set of landmark states in order of decreasing level to reach any goal state.
The overlapping region construction ensures small fractions of regret (deterioration
Background 44
from optimal performance) and the algorithm works directly over a wide range of
stochasticity.
This algorithm is specifically designed for point-to-point movement in problems
where there are few transitions between states.
3.6 Conclusions and Motivation for HEXQ
There is strong support for decomposition and a hierarchial approach to learning in
complex domains.
Automatically discovering decompositions and building task hierarchies are still
in the early stages of research. Each of the above approaches automate some of the
common elements of hierarchical learners discussed, namely, the multi-level gating
mechanisms, action abstraction, state abstraction and value-function decomposition.
Finding good sub-tasks is clearly important. Discovered sub-tasks should be reusable
multiple times and sub-task policies should be implementable as abstract actions at
higher levels of abstraction.
Despite reviewing various ways of reducing MDP representations, for example
by using the factored representation of Bayes nets, Boutilier et al. (1999) still see
a research gap. They identify the problem as follows: “Unfortunately, we are not
aware of any particular useful heuristics for finding serial decompositions for Markov
decision processes. Developing such heuristics is clearly an area for investigation.”,
and “... the problems of discovering good decompositions, constructing good sets
of macros, and exploiting intensional representations are areas in which clearer,
compelling solutions are required”.
This thesis describes a hierarchical reinforcement learning algorithm, HEXQ,
that automates each of the common elements discussed above. HEXQ takes a flat
reinforcement learning problem, without a prior model, and attempts an automatic
hierarchical decomposition and solution. HEXQ may be imagined as a learning agent
Background 45
that autonomously constructs an interconnected hierarchy of models based purely on
its sense-act interaction with its environment. The sensor state is assumed to make
the environment accessible, in that, state transitions and rewards are Markov with
respect to the current sensor state and next action. The sensor state is described
by a vector of variables representing environmental features. It is the hierarchical
relationship between these features that HEXQ tries to discover and exploit.
HEXQ performs a simple variable-wise decomposition. HEXQ automatically
finds regions of sub-space that it can reuse multiple times. It learns a restricted
set of policies over these regions to provide flexible abstract actions for higher level
models. Higher level models are based on state abstractions that together with the
abstract actions form well defined Markov models. HEXQ uses a decomposed value
function that exactly and compactly represents the “flat” value function for any
hierarchical policy.
HEXQ was inspired by (1) the observation that the scientific method endeavors
to discover Markov sub-problems and could possibly be automated, and (2) the idea
that automatic discovery of structure in sequences (Nevill-Manning, 1996) may be
generalisable to state-action spaces. The various approaches to room grid-world de-
compositions (Singh, 1992, Dayan and Hinton, 1992, Dean and Lin, 1995, Digney,
1996, Hauskrecht et al., 1998, Parr, 1998, Boutilier et al., 1999, Precup, 2000, Sut-
ton et al., 1999b) provided the challenge for early decomposition attempts. Value
function decomposition was directly inspired by MAXQ (Dietterich, 2000).
The next chapters will start with the decomposition of stochastic shortest path
MDPs.
Chapter 4
HEXQ Decomposition
This chapter describes the decomposition of multi-dimensional stochastic shortest
path MDPs. Specific decomposition conditions are stated that may partition the
state space of a multi-dimensional MDP into a hierarchy of smaller sub-spaces.
The aim in this Chapter is to describe the decomposition theoretically. Chapter
5 will describe how the decomposition and solution processes are automated in
practice by the HEXQ algorithm.
HEXQ is the name used throughout this thesis to refer to the decomposition
conditions, the resulting hierarchy of sub-MDPs and the algorithm that automates
the decomposition procedure.
The theory is developed in three sections:
• A multi-dimensional MDP is decomposed into a tree of smaller semi-MDPs
(sub-MDPs). Each policy of a child sub-MDP in the tree is invoked as an
abstract action from its parent sub-MDP.
• The HEXQ decomposition is shown to be hierarchically optimal. Under certain
conditions a globally optimal solution to the original MDP is assured.
• The HEXQ decomposition allows state abstraction. The value function, given
the hierarchy, can be compactly and losslessly represented, reducing storage
46
HEXQ Decomposition 47
Goal
Figure 4.1: A simple example showing three rooms interconnected via doorways.Each room has 25 positions. The aim of the agent is to reach the goal.
requirements and implicitly transferring learning between sub-tasks.
The simple maze discussed in Chapter 1 and reproduced in figure 4.1 is used to
illustrate the basic concepts. This maze interconnects three rooms via doorways.
Two of the doorways lead to a goal. The agent can only sense the cell that it occupies
and move one step vertically or horizontally using actions North, South, East and
West. The thick solid lines indicate barriers through which the agent cannot move.
To keep the illustration simple, all transitions are deterministic with reward −1.
The agent is started at a random location.
4.1 Assumptions
It is assumed that an episodic MDP is provided with all rewards negative and the
discount rate set to one. These MDPs are called stochastic shortest path problems
because the objective is to minimize the distance to termination. The distances
between states are considered to be negative rewards or costs.
A proper policy is a policy that leads to the termination of an MDP, with prob-
HEXQ Decomposition 48
ability one, from every initial state. A policy that is not proper is improper. The
optimum policy of a stochastic shortest path problem1 will come from the hypothesis
space of proper policies as all improper policies will have unbounded value functions
for some states. This follows from equation 2.4 in section 2.2 when all rewards are
negative and the discount rate is one.
All further references to MDPs will assume stochastic shortest path conditions
and policies will be assumed proper, until Chapter 7, where the HEXQ decomposi-
tion is extended to general finite MDPs.
Variable values in this thesis are not required to have any metric or ordering
property. They can have arbitrary attribute designations and need not be numeric.
Often variables will be described by the notation x and y, etc. This is not meant
to imply a co-ordinate system. For example, a valid description of one state in the
maze problem in figure 4.1 is (red-room, center-position)2.
4.2 HEXQ Hierarchies
This section describes the decomposition of the state space of an MDP into a par-
tition of regions in such a way that each region can be used to construct smaller
MDPs. Smaller MDPs defined over these regions are solved to generate a cache of
policies. The idea is to solve the overall problem more efficiently by using the cache
of regional policies as abstract actions in a semi-MDP in which the states are an
abstraction of the regions.
1Bertsekas and Tsitsiklis (1991) have generalized stochastic shortest path problems to includenon-negative rewards under the condition that the state value function for improper policies is−∞.
2As a matter of convenience, all algorithmic implementations in this thesis have mapped thevariable values a subset of non-negative integers. Further implementations may use more flexibleand efficient data structures
HEXQ Decomposition 49
4.2.1 Partitioning MDPs of Dimension Two
The starting point is to decompose a 2-dimensional MDP. Later, this will be gen-
eralised to any number of dimensions. For a 2-dimensional MDP, the overall state
is represented by two variables, say s = (x, y). The first step is to partition3 the
state space into regions in the context of one of the variables, say the y variable.
The aim is to construct regions with respect to the x variable that (1) have similar
Markov models and (2) allow control over inter-region transitions. These properties
are important for state abstraction, as will be seen later.
The concept of an exit is critical to a HEXQ4 decomposition and will be developed
and illustrated next. A transition from state s to state s′ on action a was introduced
in Chapter 2 and is written s →a s′.
For any partition of the state space, an inter-block transition is a transition
between any two distinct blocks of the partition, such that the probability of the
transition is greater than zero. In other words, s →a s′ is an inter-block transition
if the state space is partitioned into blocks i.e. G = {g1, . . . , gm}, s ∈ gi, s′ ∈ gj,
i 6= j and T ass′ > 0.
A terminating transition is a transition for which the original MDP terminates.
A variant transition (with respect to y) in the (x, y) state space is one with
non-Markov outcomes from the perspective of the x variable value alone. This
arises when either the y variable changes value or the state transition probability
or reward function are different between the same x variable values for different y
variable values. More formally, s →a s′ is a variant transition if for all s = (x, y),
s′ = (x′, y′):
1. y 6= y′ or
2. T ass′ 6= T a
tt′ or Rass′ 6= Ra
tt′ or y′′ 6= y′′′ ∀ t →a t′, t = (x, y′′), t′ = (x′, y′′′)
3see Appendix A4HEXQ was originally derived from the amalgam of Hierarchical, EXit and Q function.
HEXQ Decomposition 50
Definition 4.1 An exit is the state-action pair (s, a) from any inter-block, termi-
nating or variant transition, s →a s′.
For any exit (s, a), s is referred to as the exit state and a as the exit action. An
agent is said to execute exit (s, a) if it takes exit action a from state s. The entry
states of a block are all those states in the block reached in one step following the
execution of an exit from any block, or, the states in the block that belong to the
set of starting states of the MDP. The set of exit states of a block g is designated
Exits(g) and the set of entry states Entries(g).
Definition 4.2 A HEXQ Partition is a partition of the states of an MDP into
blocks such that, within each block, each exit state can be reached eventually (with
probability one) from any entry state using only non-exit actions. The blocks of a
HEXQ partition will be referred to as regions.
Figures 4.2 to 4.4 give examples of exits and HEXQ partitions. Chapter 5 will
describe how the HEXQ algorithm, using strongly connected components of the
underlying state transition graph, can form these regions. The objective at this
stage is to verify that the regions and exits are consistent with their definitions. In
the figures the circles represent base level states. The squares with rounded corners
represent the regions of a partition. Each instance of a state is given by subscripted
pairs of x and y variable values, eg (x3, y1). Transitions between states are indicated
by arrows, labelled with associated actions a, b or c, joining the base level states.
Figure 4.2 shows a stochastic transition from state (x3, y0) on action b. The next
state may be (x2, y1) where the y variable has changed value. The pair ((x3, y0), b) is
therefore an exit because it is associated with a variant transition. By condition 2 of
variant transitions, ((x3, y1), b) is therefore also an exit. States (x1, y0) and (x2, y1)
are both entry states. It is not possible to reach exit state (x3, y1) from entry
state (x0, y0) with non-exit actions, necessitating the two regions. Because exits are
generated when the y variables change value, the coarsest possible partition is one
HEXQ Decomposition 51
region0
x0 x1
x2 x3
y0
region1
x0 x1
x2 x3
y1Entry state
Exit state
Entry state
Entry stateExit state
a
a
a a
a
a
b b
Figure 4.2: For transition (x3, y0) →b (x2, y1) the y variable changes value. As thisis a variant transition, ((x3, y0), b) is an exit. If all states were in the one region,then entry state (x0, y0) cannot reach exit state (x3, y1) using non-exit actions. Tworegions are therefore necessary to meet the HEXQ partition requirements.
in which each region is associated with one of the y variable values, creating |Y |regions.
Figure 4.3 illustrates an example where multiple regions are required even though
all base level states share the same y variable value. Multiple regions are necessary
in this case to ensure each entry state can reach each exit state. The exit ((x3, y0), b)
is created as the underlying transition is inter-block.
Figure 4.4 shows an exit created as a result of condition 2 of variant transitions.
Here the reward values for transitions (x0, y0) →b (x1, y0) and (x0, y1) →b (x1, y1)
differ. Exits are similarly created if the transition probabilities vary between the
same x variable values for different y variable values.
In formulating an MDP description of the rooms problem in figure 4.1, the state
space can be described in a variety of ways. Figure 4.5 shows two alternatives. In
(a) the states are described by the variables, x =position-in-room and y =room-
number. This could correspond to two sensors of a robot, one that observes its
location relative to walls and the other the room colour. In (c), the states are
HEXQ Decomposition 52
region0
x0 x1
x2 x3
y0
region1
x4 x5
x6 x7
y0Entry state
Exit state
Entry state
Entry state
Entry state
Exit state
b
a aa
a
a
a
cc
Figure 4.3: In this example all states have the same y value. If all states are in theone region, then entry state (x5, y0) cannot reach exit state (x3, y0). Therefore, tworegions are necessary to meet the HEXQ partition requirements. The inter-blocktransition means that ((x3, y0), b) is an exit.
region0
x0 x1
x2 x3
y0
region1
x0 x1
x2 x3
y1Entry and Exitstate
Exit stateExit state
Entry stateEntry stateEntry and Exit
state
r = - 1 r = - 2a
a
a a
a
a
b b
b b
Figure 4.4: The transitions (x0, y0) →b (x1, y0) and (x0, y1) →b (x1, y1) have differ-ent associated rewards and hence give rise to exits ((x0, y0), b) and ((x0, y1), b) bycondition 2 of variant transitions.
HEXQ Decomposition 53
described by their x and y coordinates as may be the case if the robot has a GPS
like sensor. Figure 4.5 (b) and (d) show the respective HEXQ partitions of regions
for each of the two state space descriptions in (a) and (c). The three regions in
(b) are the rooms. If the position-in-room variable values differ for each room then
HEXQ may not find any useful decomposition to allow generalisation. Each region
in the partition may then be a single state in which case there is no benefit to the
decomposition, underlining the importance of representation.
(0,0) (1,0) (2,0) (3,0) (4,0)
(5,0) (6,0) (7,0) (8,0) (9,0)
(10,0) (11,0) (12,0) (13,0) (14,0)
(15,0) (16,0) (17,0) (18,0) (19,0)
(20,0) (21,0) (22,0) (23,0) (24,0)
(0,1) (1,1) (2,1) (3,1) (4,1)
(5,1) (6,1) (7,1) (8,1) (9,1)
(10,1) (11,1) (12,1) (13,1) (14,1)
(15,1) (16,1) (17,1) (18,1) (19,1)
(20,1) (21,1) (22,1) (23,1) (24,1)
(0,2) (1,2) (2,2) (3,2) (4,2)
(5,2) (6,2) (7,2) (8,2) (9,2)
(10,2) (11,2) (12,2) (13,2) (14,2)
(15,2) (16,2) (17,2) (18,2) (19,2)
(20,2) (21,2) (22,2) (23,2) (24,2)
Goal(a)
(b)
(0,9) (1,9) (2,9) (3,9) (4,9)
(0,8) (1,8) (2,8) (3,8) (4,8)
(0,7) (1,7) (2,7) (3,7) (4,7)
(0,6) (1,6) (2,6) (3,6) (4,6)
(0,5) (1,5) (2,5) (3,5) (4,5)
(5,9) (6,9) (7,9) (8,9) (9,9)
(5,8) (6,8) (7,8) (8,8) (9,8)
(5,7) (6,7) (7,7) (8,7) (9,7)
(5,6) (6,6) (7,6) (8,6) (9,6)
(5,5) (6,5) (7,5) (8,5) (9,5)
(0,4) (1,4) (2,4) (3,4) (4,4)
(0,3) (1,3) (2,3) (3,3) (4,3)
(0,2) (1,2) (2,2) (3,2) (4,2)
(0,1) (1,1) (2,1) (3,1) (4,1)
(0,0) (1,0) (2,0) (3,0) (4,0)
Goal(c)
(d)
Figure 4.5: HEXQ partitioning of the maze in figure 4.1. The state representationeffects the partitioning. In (a) the agent uses a position in room and room sensorresulting in three regions (b). In (c) the agent uses a coordinate like sensor, thatpartitions the state space into the 15 regions shown in (d).
In the second case, (c), the HEXQ partition results in 15 regions as shown in
figure 4.5 (d). In this case every state is an exit state, because a North or South
action for any x variable value may change the y variable value. The regions are
divided into two ranges for the x values, one from 0 to 4 and one from 5 to 9. The
HEXQ Decomposition 54
subdivision of regions into these two ranges is created because of variant transitions.
For example, a transition from state (4, 8) to (5, 8) is not possible, but the transition
from (4,7) to (5,7) is possible. If similar x values on action East do not have the
same state transition function for all y values, exits are created. The 10 exits are
((4, ·), East) with the y variable value ranging from 0 to 9. These exits prevent some
entry states reaching exit states requiring the two regions for the same y variable
value. Moving between states (3, y) and (4, y) with similar actions is possible with
reward −1 for all y values. These transitions do not give rise to exits.
The vector of variables used to describe a multi-dimensional MDP has no reason
to order the variables in any specific way. The foregoing decomposition conditions
assumed regions to be formed with respect to the x variable and in the context of
the y variable. For any particular problem, the assignment of the x and y variables
in the multi-dimensional state vector may be interchanged for the purposes of de-
composition. This may result in a different partition. Chapter 5 will show how the
variables are heuristically ordered in practice to find good partitions, but the HEXQ
partitioning process does not require a particular order for the variables per se.
4.2.2 Sub-MDPs
Given a HEXQ partition, it is possible to construct sub-MDPs using the states in
each region and the transition and reward functions of the original MDP.
A smaller stochastic shortest path MDP can be constructed by taking a region
of a HEXQ partition, the non-exit transition and reward functions of the original
MDP, modelling one exit of the region as a zero reward transition to an absorbing
state and making the value function the undiscounted sum of future rewards.
This is a valid sub-MDP as the transition and reward functions are Markov
by virtue of being inherited from the original MDP. Each exit state of the region is
proper because it can be reached from all entry states as a requirement of the HEXQ
partition. The construction therefore satisfies the requirements for a stochastic
HEXQ Decomposition 55
shortest path MDP.
In this way it is possible to construct one MDP for each exit of every region
in the HEXQ partition of the original MDP. To distinguish these MDPs from the
original MDP they are referred to as sub-MDPs. It is important to note that a
policy can only leave the region via the specified exit because other exit actions are
disallowed by construction of the sub-MDP.
4.2.3 Top level semi-MDP
A semi-MDP can be constructed to find a proper policy for the original MDP. A
semi-MDP m is defined from the states of the original MDP and abstract actions
that are proper policies for the sub-MDPs constructed in section 4.2.2.
To see that this forms a semi-MDP consider the value of state s. By definition it
is the expected sum of future primitive rewards for a policy π(s) that by construc-
tion invokes sub-MDP policies (abstract actions) and follows them until their exit,
whereupon another sub-MDP is invoked.
V πm(s) = E{r1 + r2 + r3 . . .} (4.1)
Let N be the random number of single step transitions, starting in state s,
executing abstract action a and terminating in state s′ (s →a s′). The value function
can then be written as the expectation of the sum of two series.
V πm(s) = E{
N−1∑n=1
rn +∞∑
n=N
rn} (4.2)
The first series is the expected sum of the rewards to termination of the abstract
action in state s′ including the reward on exit and the second is the value of the
state s′ reached following termination.
HEXQ Decomposition 56
V πm(s) =
∑
s′,N
TNass′ [RNa
ss′ + V πm(s′)] (4.3)
This defines a semi-MDP as equation 4.3 has the form of a Bellman equation for
a semi-MDP (see equation 2.17).
The HEXQ partitioning and construction of the sub-MDPs will be called a HEXQ
decomposition. While the treatment here is similar to that of other researchers,
(Dean and Lin (1995), Parr (1998), Hauskrecht et al. (1998), Dietterich (2000)) a
key difference in the HEXQ formulation of the semi-MDP is that abstract actions
terminate on exit execution whether the region is left or not. An aggregate state
(or region) can therefore transition to itself.
Region policies and termination after exit execution are a specific form of options
(Sutton et al., 1999a) in which the probability of termination is 1 when executing
an exit and 0 otherwise.
Solving this semi-MDP provides a solution to the original MDP constrained by
the abstract actions. As for HAMs, MAXQ and options, this policy is in general
non-stationary and sub-optimal. It is non-stationary because the primitive actions
taken in some states may depend on which abstract action was executed previously
from an entry state. It may be sub-optimal because sub-MDP policies have been
restricted to those producing proper single exits. Section 4.3 explores the optimality
of HEXQ decompositions in greater detail.
To illustrate the generated hierarchical structure for the simple maze with the
state space and partition shown in figure 4.5 (a) and (b), the top semi-MDP and
12 sub-MDPs for the regions are shown in figure 4.6. Each region has four exits
and therefore generates four sub-MDPs. All states are entry states as they are
all possible initial states. The exit states in the top semi-MDP are highlighted by
shading. Starting in any state, say (3, 0), there are four abstract actions that can be
chosen corresponding to the policies of the sub-MDPs. For state (3, 0) the relevant
sub-MDPs are those leftmost in figure 4.6. There is one sub-MDP and policy to leave
HEXQ Decomposition 57
(0,0) (1,0) (2,0) (3,0) (4,0)
(5,0) (6,0) (7,0) (8,0) (9,0)
(10,0) (11,0) (12,0) (13,0) (14,0)
(15,0) (16,0) (17,0) (18,0) (19,0)
(20,0) (21,0) (22,0) (23,0) (24,0)
(0,1) (1,1) (2,1) (3,1) (4,1)
(5,1) (6,1) (7,1) (8,1) (9,1)
(10,1) (11,1) (12,1) (13,1) (14,1)
(15,1) (16,1) (17,1) (18,1) (19,1)
(20,1) (21,1) (22,1) (23,1) (24,1)
(0,2) (1,2) (2,2) (3,2) (4,2)
(5,2) (6,2) (7,2) (8,2) (9,2)
(10,2) (11,2) (12,2) (13,2) (14,2)
(15,2) (16,2) (17,2) (18,2) (19,2)
(20,2) (21,2) (22,2) (23,2) (24,2)
Goal
0 1 2 3 45 6 7 8 9
10 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 9
10 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 9
10 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 9
10 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 9
10 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 9
10 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 910 11 12 13 1415 16 17 18 1920 21 22 23 24
0 1 2 3 45 6 7 8 9
10 11 12 13 1415 16 17 18 1920 21 22 23 24
Figure 4.6: The HEXQ tree for the simple maze showing the top level semi-MDPand the 12 sub-MDPs, 4 for each region. The numbers shown for the sub-MDP arethe position-in-room variable values.
the room to the North, East, South and West. The agent will learn that choosing
the abstract action that leaves the room to the East is optimal when starting in state
(3, 0) as this is the shortest way to the goal. When this abstract action terminates
in state (10, 1), another abstract action is chosen, this time the policy corresponding
to the top right sub-MDP in figure 4.6 which will lead the agent directly to the goal.
It is evident from figure 4.6 that some of the sub-MDPs are identical. HEXQ
will merge some of the leaves of this tree to form a more compact directed acyclic
graph representation to be described later in this Chapter. In the next section the
HEXQ decomposition is generalized to higher dimensional MDPs.
HEXQ Decomposition 58
4.2.4 Higher Dimensional MDPs
The HEXQ decomposition of MDPs with dimension greater than 2, say d, is based
on the recursive application of the above decomposition with two state variables.
The variables in the state vector are grouped into two sets {s1, s2, . . . , sd−1} and
{sd}. A new state vector is defined with two state variables, the first being the
cartesian product of the d − 1 variables. Hence s = (s1 × s2 × . . . × sd−1, sd).
The HEXQ decomposition is applied as in section 4.2.3. Each sub-MDP is now a
stochastic shortest path MDP in its own right following the constructions in section
4.2.2 with a state vector of effectively d−1 variables as the variable sd is constant for
each sub-MDP. Each sub-MDP can be further HEXQ decomposed by formulating
an equivalent MDP with the two state variables s = (s1 × s2 × . . .× sd−2, sd−1). In
this way the original MDP is recursively decomposed, one variable at a time. The
recursive decomposition halts when the sub-MDPs have been reduced to one state
variable. A tree (HEXQ tree) of sub-MDP nodes is generated in this way with a
depth of d− 1. The levels in the tree are labelled with consecutive integers from the
leaf node sub-MDPs (level 1) to the root node sub-MDP (level d).
All the sub-MDPs in the tree are semi-MDPs except for the leaf node sub-MDPs5.
4.2.5 Value Function Decomposition with HEXQ Trees
The aim of this subsection is to derive the equations for the recursive representation
of the value functions of a state s in sub-MDP m in the HEXQ tree executing a
hierarchical policy. The value function decomposition was motivated by Dietterich
(2000) who in turn built on previous work by Singh (1992), Kaelbling (1993), Dayan
and Hinton (1992), Dean and Lin (1995).
5All the HEXQ results for stochastic shortest path MDPs as defined here apply just as wellif the leaf node sub-MDPs were semi-MDPs. As the discount factor is 1, the value function andoptimum policy do not depend on the timing of base-level state transitions.
HEXQ Decomposition 59
Definition 4.3 A hierarchical policy is a set of proper policies, one for each
sub-MDP in the HEXQ tree. So for sub-MDPs 1, 2, . . . , n hierarchical policy π =
{π1, π2, . . . , πn}.
s sa s’
sub-MDP ma
sub-MDP m
aNssa
R ,1−
NassR '
assa
R '
Figure 4.7: An example trajectory under policy π, for N = 4 steps, where sub-MDP m invokes sub-MDP ma using abstract action a, showing the sum of primitiverewards to the exit state sa and the primitive reward on exit.
For sub-MDP m, assume that the current abstract action, π(s) = a, has invoked
sub-MDP ma and will terminate in state s′, after a random number of N time-steps,
for a given hierarchical policy π. The reward on exit of sub-MDPs is not included
in the value function by construction. Therefore, to use the values of the states
from sub-MDP ma in equation 4.2 the reward to exit is split into the cumulative
reward RN−1,assa
to reach the exit state sa and the expected primitive reward on exit
Rasas′ . The notation V π
m(s′) refers to the value of state s′ in sub-MDP m where all
sub-MDPs invoked by m follow the hierarchical policy π. Figure 4.7 illustrates the
components of the total reward for transitioning from state s through to state s′ as
a result of taking abstract action a that invokes sub-MDP ma.
HEXQ Decomposition 60
V πm(s) =
∑
s′,N
TNass′ [RN−1,a
ssa+ Ra
sas′ + V πm(s′)] (4.4)
Theorem 4.4 For any proper sub-MDP policy in the HEXQ tree the next state on
exit and the expected reward on exit are independent of the entry state.
Proof For a region in a HEXQ partition, any exit state can be reached from
every entry state using non-exit actions, by definition. By construction, any partic-
ular sub-MDP defined over a region has all exits disallowed except for one that is
modelled as an absorbing state (subsection 4.2.2). A proper sub-MDP policy must
therefore use this exit independently of the entry state. The independence of the
exit makes the reward on exit and the next state independent of the entry state
because of the Markov property of the transition and reward functions.
The first term on the RHS of equation 4.4 is the value of state s in sub-MDP
ma as it represents the expected sum of rewards to the exit state of sub-MDP ma
for abstract action a. Because discount factor γ = 1, γ does not appear in these
equations. As rewards on exit and the next state s′ are independent of N , the
expectation can be taken with respect to the exit reward and state s′ alone. Hence
V πm(s) = V π
ma(s) +
∑
s′T a
sas′ [Rasas′ + V π
m(s′)] (4.5)
where T asas′ is the probability of transitioning to state s′ after abstract action a
terminates and Rasas′ is the expected final primitive reward on transition to state s′
when abstract action a terminates. Equation 4.5 decomposes the value function for
a state recursively.
Definition 4.5 The HEXQ action-value function E (or exit value for short)6
6While the HEXQ function E is similar to Dietterich’s completion function there are importantdifferences. In particular, the inclusion of the primitive reward on exit in the HEXQ function willbe seen to have advantages, such as, greater compaction of the value function representation, theautomatic generation of the decomposition hierarchy and automatic hierarchical credit assignment.
HEXQ Decomposition 61
of state s for abstract action, a, in sub-MDP m, is the expected value of future
rewards after completing the execution of the abstract action, a, starting in state
s and following the hierarchical policy π thereafter. Note that E includes the ex-
pected primitive reward on exit, but does not include any rewards accumulated while
executing the sub-MDP associated with a.
Eπm(s, a) =
∑
s′T a
sas′ [Rasas′ + V π
m(s′)] (4.6)
Substituting equation 4.6 into equation 4.5 gives
V πm(s) = V π
ma(s) + Eπ
m(s, a). (4.7)
For a 1-dimensional MDP and for all level 1 HEXQ sub-MDPs of a multi-
dimensional MDP, the HEXQ function E is identical to the normal Q function.
In this case the value of term V πma
in equation 4.7 is zero (since there are no internal
transitions in a primitive state) and V πm(s) = Eπ
m(s, a) = Qπm(s, a) where a = π(s).
The HEXQ function E can be interpreted as a hierarchical generalisation of the Q
function.
Given a HEXQ tree of depth d − 1 and a hierarchical policy π which invokes a
nested set of abstract actions ad, ad−1, . . . , a1 with corresponding sub-MDPs d, d −1, . . . , 1 the value of state s is decomposed as
V πd (s) = Eπ
1 (s, a1) + Eπ2 (s, a2) + . . . + Eπ
d (s, ad). (4.8)
Figure 4.8 shows an example of how the value of state (3, 0) is composed from the
E values in the sub-MDP for the top left room plus the E value in the top sub-MDP.
It is assumed that the hierarchical policy at the top level invokes the abstract action
that moves the agent out of the East doorway and, for the top left room sub-MDP,
the next action is down. As the reward is −1 for each step, the E value to the
doorway is −3 and the E value for the abstract action to exit and then to reach the
HEXQ Decomposition 62
(0,0) (1,0) (2,0) (3,0) (4,0)
(5,0) (6,0) (7,0) (9,0)
(10,0) (11,0) (12,0)
(15,0) (16,0) (17,0) (18,0) (19,0)
(20,0) (21,0) (22,0) (23,0) (24,0)
(0,1) (1,1) (2,1) (3,1) (4,1)
(5,1) (6,1) (7,1) (8,1) (9,1)
(11,1) (12,1) (13,1) (14,1)
(18,1) (19,1)
(20,1) (21,1) (23,1) (24,1)
(0,2) (1,2) (2,2) (3,2) (4,2)
(5,2) (6,2) (7,2) (8,2) (9,2)
(10,2) (11,2) (12,2) (13,2) (14,2)
(15,2) (16,2) (17,2) (18,2) (19,2)
(20,2) (21,2) (22,2) (23,2) (24,2)
( ) ( ) ( )
8
53
__),0,3(),0,3()0,3( 212
−=−−=
+= eastroomleaveEsouthEV πππ
Figure 4.8: The value of state (3, 0) is composed of two HEXQ E values.
goal is −5.
This section has illustrated how a multi-dimensional MDP is HEXQ decomposed
into a tree of sub-MDPs. The value of any state is the sum of exit values of the
sub-MDPs invoked along a path from the root to a leaf node of the HEXQ tree.
4.3 Optimality of HEXQ Trees
The question arises as to whether decomposed MDPs can provide an optimal so-
lution to the original MDP. This section will show that a HEXQ decomposition is
hierarchically optimal in general and globally optimal under certain conditions.
The solution to a hierarchically decomposed MDP can be classified by different
kinds of optimality. It is of course desirable for the hierarchically decomposed MDP
to achieve the globally optimal solution to the original MDP. As far as is known,
there is no general method that can solve all MDPs both with reduced computational
complexity and optimally by decomposition. If it is not possible to achieve an
HEXQ Decomposition 63
optimal solution then the next best outcome is to guarantee that a solution is ε-
optimal. This means that, for all states, the value function can be guaranteed to
be close to optimal. Parr (1998) suggests generating an ε-optimal policy cache over
the regions of a decomposition. A policy cache is ε-optimal if for any set of values
of termination states of a region, a policy can be found such that after one value
iteration, the value of any state in the region does not change by more than ε. It is
possible to trade off optimality against the number of required policies in the cache.
However, unless the regions are very loosely connected or ε is very large, the number
of policies necessary to construct the cache may be prohibitive.
Imposing a hierarchy and hierarchical execution usually constrains the policies in
some way that prevents global optimality guarantees. Is is then desirable to achieve
the best value for all states consistent with the constraints of the hierarchy. A policy
that meets these criteria is called a hierarchically optimal policy. It is easy to see
that a hierarchically optimal policy is not necessarily globally optimal. For example,
when sub-MDPs in the hierarchy are invoked they will execute until termination,
even though, due to stochastic actions, it may be better to switch to a different
strategy.
For MAXQ, in which sub-MDPs are reused, Dietterich (2000) defines a weaker
form of optimality called recursive optimality. A recursively optimal policy is one
in which each sub-MDP is solved with defined values for its termination states
(Parr calls them out-states) without considering the context of the sub-MDP in the
overall problem. In MAXQ, for stochastic shortest path problems, Dietterich (2000)
equates these values to zero for all desirable (goal) terminations and an arbitrary
large negative number for undesirable terminations. Dietterich then reuses the policy
learnt for these termination values in other situations where the same goal states
are desired, but where they may have different values. While this has the advantage
of reuse, MAXQ sub-task policies may not be optimal in their different contexts.
MAXQ is therefore not necessarily hierarchically optimal, meaning that the overall
HEXQ Decomposition 64
S1
S3
S2b
c
a
b
b
a
a
0.5
0.5
r = -1r = -1
r = -10
r = -10
r = -100
r = -100
Figure 4.9: A region with two exits, where the HEXQ decomposition misses a po-tentially low cost exit policy from the region.
solution may not be the best one given just the constraints of the hierarchy.
Optimality for a HEXQ decomposition can be demonstrated to be arbitrarily
worse than a globally optimal policy, as the example in figures 4.9 and 4.10 shows.
Figure 4.9 illustrates a region of three states, s1, s2 and s3 together with the tran-
sitions and rewards between them. The vertical line indicates the region boundary.
There are two possible exits, one from state s2 and the other from state s3, that lead
out of the region on action b. The entry state is assumed to be s1. The actions are
deterministic, except for action a from state s1 that has a 50% probability of moving
to either state s2 or s3. The generated HEXQ sub-MDPs for this region will provide
one policy for leaving the region via each exit, but not both as shown in figure 4.10
(a) and (b). The optimum value of state s1 is −10 taking optimum action b or c
depending on the exit goal. While action a may reach the exit state with less cost,
there is a 50% risk that the wrong exit state is reached and a heavy penalty is in-
curred (a reward of −100) to recover, i.e. return to state s1. If the globally optimal
policy did not favor either exit, then a policy that includes taking action a in state
s1 improves the value of state s1 from -10 to -1. This policy, shown in figure 4.10
(c), is unavailable to HEXQ by construction and therefore HEXQ may not produce
HEXQ Decomposition 65
S1
S3
S2
a
b
b0.5
0.5
r = -1
r = -1
S1
S3
S2
cb
r = -10
S1
S3
S2b
br = -10
(a)(b)
(c)
Figure 4.10: For the region in figure 4.9 the optimal policies for the two sub-MDPscreated by HEXQ (one for each exit) are shown in (a) and (b). The optimal policyto use either exit, shown in (c), is not available to HEXQ by construction.
a globally optimal solution. Depending on the magnitude of the rewards in figure
4.9 the HEXQ value function may be arbitrarily worse than optimal.
Parr (1998) makes the same point about HAMQ learning which he proves to be
hierarchically optimal. An MDP combined with a HAM machine will not necessarily
find the optimal solution to the MDP. While the solution will be consistent with the
constraints imposed by the underlying machine specification, there is no guarantee
that the resultant HAM will produce a solution that is close to optimal (Parr, 2002).
Hierarchical execution means that sub-MDPs are executed until termination.
Hauskrecht et al. (1998) point out that in this case a sub-MDP will stubbornly
attempt to exit a region by the exit determined from the level above, even though
the agent may have slipped closer to another exit that is now more optimal. Di-
etterich generalises Kaelbling’s (1993) ideas showing that a possible improvement
to a recursively optimal solution is to execute it non-hierarchically and re-evaluate
the optimal policy at every level after every step. Dietterich refers to this as a hi-
erarchical greedy policy and shows that it is no worse than the recursively optimal
policy. However, even a hierarchical greedy policy cannot provide any optimality
HEXQ Decomposition 66
guarantees. The example in figures 4.9 and 4.10 makes the point. The globally
optimal policy may be to take action a from s1, however, the HEXQ tree does not
contain this possibility as all sub-MDPs have been optimized to terminate in only
one exit.
Theorem 4.6 The recursively optimal solution of a HEXQ decomposition is hier-
archically optimal.
Proof All sub-MDPs below the top level have only one exit state and the discount
factor is 1. Therefore an optimal policy for each sub-MDP is optimal in all con-
texts as the value of the state reached after exit cannot affect the optimal policy
for the sub-MDP. The top level sub-MDP does not have a context (there are no
other higher level variables) and is solved with the unique termination values from
the original MDP. Therefore, a recursively optimal solution is the best that can be
achieved given the hierarchical constraints and is therefore hierarchically optimal.
The hierarchically optimal value and exit value of state s in a HEXQ decompo-
sition follow directly from equations 2.18, 4.7 and 4.6 as follows:
V ∗m(s) = max
a[V ∗
ma(s) + E∗
m(s, a)] (4.9)
E∗m(s, a) =
∑
s′T a
sas′ [Rasas′ + V ∗
m(s′)] (4.10)
4.3.1 Globally Optimal HEXQ
The solution to a HEXQ decomposed MDP can be shown to be globally optimal in
some cases.
Parr (1998) introduced the concept of a policy cache being ε-optimal. Knowing
that a policy cache is ε-optimal provides a guarantee that the global solution is
HEXQ Decomposition 67
within a set bound of optimal. In particular, a corollary from Parr’s theorem 16
(Parr, 1998) is that if the policy caches for all regions of an MDP are 0-optimal, then
the optimal policy for the semi-MDP, using the policies from the caches as abstract
actions, is globally optimal.
Lemma 4.7 For a deterministic shortest path MDP (one with deterministic ac-
tions) the optimal policies for the sub-MDPs defined over HEXQ regions can be used
to implicitly generate 0-optimal policy caches for each region for any set of termi-
nation state values.
Proof By construction, each region has an associated set of sub-MDPs, one leading
to each region exit state. Define termination states V o(s′) as the states reached
following the execution of an exit. Assume an arbitrary set of value function values
for the termination states. The optimal value of state s is given by
V ∗(s) = maxπ
E{to exit∑n=1
rn + V o(s′)}. (4.11)
Since the actions are deterministic an optimal policy for the MDP will determine
the region’s exit. The search for the optimum value for state s can therefore be
written as a search over all possible exits leading to termination states. Furthermore,
the optimum value to the exit is given by one of the HEXQ decomposition sub-MDPs
values for state s by construction. Hence
V ∗(s) = maxi∈Exits
{V ∗i (s) + Ri + V o(si)} (4.12)
where V ∗i (s) is the optimum expected value of state s in the sub-MDP leading to
exit i. Reward Ri is the expected reward on exiting via exit i and V o(si) is the value
of the termination state following exit.
Therefore, given a set of termination state values, the optimal abstract action to
take in state s is the optimal policy for the sub-MDP with exit i that maximises the
HEXQ Decomposition 68
argument in the above equation. The best action can be determined for every state
in the region resulting in a policy that is 0-optimal for a particular set of termination
state values. In this way a 0-optimal policy cache can be constructed for any set of
out state values.
The hierarchical execution of a HEXQ decomposition for a deterministic shortest
path problem implicitly executes a 0-optimal policy cache since equation 4.12 has
the same form as equation 4.9.
Theorem 4.8 Hierarchical execution of a HEXQ decomposed deterministic shortest
path MDP is globally optimal.
Proof Proceeding by induction, if level i sub-MDPs are optimal they implicitly
generate a 0-optimal policy cache for the semi-MDP at level i + 1 by the above
lemma. By Parr’s result the MDP represented by the semi-MDP at level i + 1 is
therefore optimal. A leaf node sub-MDP has only base level states and primitive
actions and is therefore optimal.
Corollary 4.9 The hierarchical execution of a HEXQ decomposed stochastic short-
est path problem is globally optimal if the stochasticity is present at the top level
only.
Proof It is only necessary to note that for a semi-MDP to produce a globally
optimal solution for the original MDP by Parr’s theorem, it is not required to be
deterministic, just that each region of the original MDP needs to have a 0-optimal
policy cache. Since all sub-MDPs below the top level are deterministic, all regions
have a 0-optimal policy cache by lemma 4.7.
HEXQ Decomposition 69
This section has shown that the HEXQ decomposition is hierarchically optimal,
but that no globally optimal guarantees can be made for HEXQ in general. For
decompositions that only have stochastic sub-MDPs at the top level and determin-
istic actions for sub-MDP at all other levels, HEXQ has been proven to provide a
globally optimal solution.
4.4 Representing HEXQ trees compactly
A major benefit of a good HEXQ decomposition is the significant savings in com-
putation time and storage at each level of the hierarchy. This section explains the
opportunities for the compact representation of a HEXQ decomposed MDP by ab-
stracting states. The approach is to show compaction for a 2-dimensional MDP and
then to recursively generalise the results for higher dimensions.
4.4.1 Markov Equivalent Regions (MERs)
Take an MDP with state s = (x, y). A HEXQ partition is likely to contain many
similar regions. The HEXQ decomposition ensures that each leaf sub-MDP in a
HEXQ tree is independent of the y variable and that intra-region non-exit transitions
between similar x states are identical. Many regions are in a sense equivalent and
the equivalence will be used to great advantage to compact the state space.
To describe this equivalence formally, a binary relation7 B on HEXQ partition
G = {g} is defined as follows:
giBgj if and only if x = x′ for some s = (x, y) ∈ gi, s′ = (x′, y′) ∈ gj (4.13)
The binary relation B is an equivalence relation8. The equivalence classes of B on
G, [gi], will be referred to as Markov equivalent regions (MERs) to highlight the
7see Appendix A8see Appendix A
HEXQ Decomposition 70
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
(0,0) (1,0) (2,0) (3,0) (4,0)
(5,0) (6,0) (7,0) (8,0) (9,0)
(10,0) (11,0) (12,0) (13,0) (14,0)
(15,0) (16,0) (17,0) (18,0) (19,0)
(20,0) (21,0) (22,0) (23,0) (24,0)
(0,1) (1,1) (2,1) (3,1) (4,1)
(5,1) (6,1) (7,1) (8,1) (9,1)
(10,1) (11,1) (12,1) (13,1) (14,1)
(15,1) (16,1) (17,1) (18,1) (19,1)
(20,1) (21,1) (22,1) (23,1) (24,1)
(0,2) (1,2) (2,2) (3,2) (4,2)
(5,2) (6,2) (7,2) (8,2) (9,2)
(10,2) (11,2) (12,2) (13,2) (14,2)
(15,2) (16,2) (17,2) (18,2) (19,2)
(20,2) (21,2) (22,2) (23,2) (24,2)
Goal
Figure 4.11: The maze HEXQ graph with sub-MDPs represented compactly
property that non-exit state and reward transitions between x values in equivalent
regions are the same for all y values.
It follows that all sub-MDPs with the same x variable value exit state in a
HEXQ tree are identical. This means that a Markov equivalent region exit policy
can be reused for all y variable values. Because of this equivalence, a HEXQ tree
is represented compactly by redirecting edges to similar sub-MDPs at each level,
thereby converting the tree into a DAG, called a HEXQ Graph.
The savings in storage depends on the number of elements in each of the Markov
equivalent regions. The HEXQ tree for the maze problem in figure 4.6 can be
compactly represented as shown in figure 4.11 by eliminating 8 of the sub-MDPs in
figure 4.6 as they are clearly duplicates. In this example there are 3 region elements
in the one room Markov equivalent region class and the storage is reduced to one
third for the leaf node sub-MDPs.
Since all Markov equivalent region actions are hypothesized to be available for
HEXQ Decomposition 71
each region instance the agent will explore them as abstract actions in the top-level
semi-MDP. For example, starting in state (3, 0) exploratory exit attempts will be
made at (state (2, 0), action North), ((10, 0), West), ((22, 0), South) and ((14, 0),
East). Of course the first two exits will be found to be sub-optimal for all starting
states in the room as the exits simply return the agent to the respective exit state
with a reward of −1.
4.4.2 State Abstracting Markov Equivalent Regions
Allowing for the possibility of starting in any base-level state, the top level semi-
MDP would still require as many E values as the original MDP would require Q
values. This is in addition to values stored to represent the abstract actions as
sub-MDPs policies.
It is however possible to compactly represent the semi-MDP E table for any
fixed policy over the HEXQ Graph. From theorem 4.4 the next state on exit and
the expected reward on exit are independent of the entry state. This independence
means that the HEXQ action value function is not dependent on any particular state
s ∈ g and can be written as
Eπm(g, a) =
∑
s′T a
gs′ [Rags′ + V π
m(s′)] (4.14)
where g is the aggregate state containing state s, T ags′ is the probability of exiting
in s′ after taking action a from any state in g. Rags′ is the expected reward for this
exit transition. It is now only necessary to store the HEXQ function Eπm(g, a) once
instead of for all s ∈ g. The number of aggregate states |G| is less than the number
of total states |S| by a factor of |S|/|G| and represents the storage savings factor for
this type of state abstraction at each level.
For the HEXQ graph in figure 4.6 abstracting whole Markov equivalent regions
to single states, makes it possible to collapse the 75 states in the top level semi-MDP
HEXQ Decomposition 72
Goal
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1
2
Figure 4.12: The simple maze HEXQ graph with top level sub-MDP abstracted.
into just three abstract states as depicted in figure 4.12.
4.4.3 Compaction of Higher Dimensional MDPs
State abstraction for 2-dimensional MDPs can be recursively applied for d-dimensional
MDPs.
The approach is similar to that used for generating HEXQ partitions in section
4.2.4. The compaction is first applied to an equivalent two state MDP where the
state is represented as s = (s1× s2× . . .× sd−1, sd). Each sub-MDP in the resultant
HEXQ graph has one less variable in its state description and can be described with
the two state variables s = (s1 × s2 × . . . × sd−2, sd−1). MERs are now found at
level d− 1 i.e. for projected states x = s1 × s2 × . . .× sd−2. Common policies over
the MERs from all sub-MDPs are stored only once. MERs are state abstracted
and represented by single higher level states in each sub-MDP at level d − 1. This
process is recursively applied down the HEXQ tree. The end result is a compact
HEXQ Decomposition 73
LIFT
1st Floor
Ground Floor
Exit Doors
Room 0
Room 2 Room 3
Room 1
Room 2
Room 1Room 0
Figure 4.13: The multi-floor maze
HEXQ graph.
To illustrate the compaction of a higher dimensional MDP the maze example is
extended in figure 4.13 to include another floor of rooms. A first floor with four
identically sized rooms is added and linked by a lift. Exiting the north-east room to
the east on each floor transports the agent to the same room location on the other
floor. The rooms are identified in the same manner on each floor. A position in this
“house” is specified by the three state variables; floor, room and location-in-room.
Figure 4.14 and 4.15 show the HEXQ tree and compact HEXQ graph respectively.
Decomposition equations 4.9 and 4.10 can be restated in compact form as follows:
V ∗m(s) = max
a[V ∗
ma(s) + E∗
m(g, a)] (4.15)
E∗m(g, a) =
∑
s′T a
gs′ [Rags′ + V ∗
m(s′)] (4.16)
They give the compact decomposed hierarchically optimal value function for
state s in sub-MDP m at level e, where (abstract) action a invokes sub-MDP ma
defined over region g. At the lowest level V ∗ma
= 0, as all primitive actions are exit
HEXQ Decomposition 74
Figure 4.14: The HEXQ tree of sub-MDPs generated from the multi-floor maze
Figure 4.15: The HEXQ directed acyclic graph of sub-MDPs derived from the HEXQtree
HEXQ Decomposition 75
actions and do not invoke lower level routines.
The implicit state abstraction from s to g can be seen as an application of
Dietterich’s result distribution irrelevance condition for permitting state abstraction.
The Markov Equivalent Region compact representation in HEXQ is related to Max
node irrelevance and leaf irrelevance in MAXQ and is similar to model minimisation
(Dean and Givan, 1997) over the MER sub-spaces.
Ravindran and Barto (2002, 2003) have defined homomorphic mappings of sym-
metric sub-MDPs. This would allow the 4 sub-MDPs in figure 4.12 to be compacted
further into one “relativised” option, in their terminology. Automating symmetry
discovery may be a useful future research direction.
In summary, this Chapter has shown how a stochastic shortest path MDP can
be HEXQ decomposed and represented compactly by abstracting base-level states
and primitive actions. The decomposition results in a multi-level hierarchy of sub-
MDPs that, in general, are all semi-MDPs, constrained to use abstract actions that
are fixed policies for exiting sub-MDP regions. At each level of the HEXQ graph
there is substantial opportunity for compaction, as all the sub-MDPs are formulated
to be independent of variables associated with higher levels. In addition, termination
of each sub-MDP is independent of how the sub-MDP is entered and hence, the value
following exit can be represented using just one table entry for a whole region of
aggregate states.
The next Chapter will describe the HEXQ algorithm, automating the decompo-
sition and solution process of an MDP in practice.
Chapter 5
Automatic Decomposition: The
HEXQ Algorithm
This Chapter describes the basic hierarchical reinforcement learning algorithm HEXQ.
It is assumed that a multi-dimensional MDP from the class of stochastic shortest
path problems has been provided. The objective is to decompose and solve the
problem automatically when the model (the state transition and reward functions)
of the original MDP is not provided to the algorithm. The hope is to solve the
original problem more efficiently, that is, with less storage requirements and in less
time.
The steps of the HEXQ algorithm are outlined in table 5.11 based on the theory
from the previous Chapter. In contrast to the treatment in the theory, the algorithm
builds the HEXQ graph from the bottom up starting with the leaf sub-MDPs. The
rest of this section gives a broad overview of the algorithm using the maze example.
The following sections will then give a more detailed account.
The choice of representation of the original problem plays a significant role in the
decomposition. For the simple maze example in the top of figure 5.1 the state space
is assumed to be described by the two variables: position-in-room and room-number.
1The notation to describe the state vector for the original MDP has been changed from s to xto reserve the variables s to describe aggregate states in the HEXQ decomposition.
76
Automatic Decomposition: The HEXQ Algorithm 77
Table 5.1: The HEXQ algorithm
function HEXQ( MDP〈states = X, actions = A〉 )
1. X ← sequence of variables 〈X1, X2, . . . , Xd〉 sorted by frequency of change
2. S1 ← X1
3. A1 ← A
4. for level e ← 1 to d− 1
5. explore Se, Ae transitions at random to find T ass′ , Exits(Se), Entries(Se)
6. MERe ← Regions(Se, Ae, T ass′ , Exits(Se), Entries(Se))
7. construct sub-MDPs from MERe using Exits(Se)
8. for all sub-MDPs m
9. E∗m(se, ae) ← V alueIteration(sub-MDP m, γ = 1 )
10. Ae+1 = ∪i∈MEReExits(i)
11. Se+1 = MERe ×Xe+1
12. Execute(level d, initial state, top level sub-MDP)
end HEXQ
The sub-task of intra-room navigation, using the position-in-room variable, does not
depend on the room-number variable value. In this sense rooms are regions invariant
of the specific room number. HEXQ tries to find sub-space regions, one variable at a
time, for which the state transition and reward functions are invariant in all contexts
defined by the values of the remaining variables. This invariance is exploited later,
as policies learnt over these regions are reused in different contexts. The individual
state variables are used to construct the different levels of the hierarchy. There are
as many levels in the hierarchy as there are variables in the overall state description2.
The algorithm starts by sorting the variables of the problem by frequency of
change (line 1), as frequently changing variables are likely to be associated with
task at lower levels. Since the room-number changes at less frequent time intervals
the algorithm explores the behaviour of the position-in-room variable first (lines 2
2section 5.7 describes the conditions under which levels may be collapsed to optimise storagerequirements.
Automatic Decomposition: The HEXQ Algorithm 78
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
(0,0) (1,0) (2,0) (3,0) (4,0)
(5,0) (6,0) (7,0) (8,0) (9,0)
(10,0) (11,0) (12,0) (13,0) (14,0)
(15,0) (16,0) (17,0) (18,0) (19,0)
(20,0) (21,0) (22,0) (23,0) (24,0)
(0,1) (1,1) (2,1) (3,1) (4,1)
(5,1) (6,1) (7,1) (8,1) (9,1)
(10,1) (11,1) (12,1) (13,1) (14,1)
(15,1) (16,1) (17,1) (18,1) (19,1)
(20,1) (21,1) (22,1) (23,1) (24,1)
(0,2) (1,2) (2,2) (3,2) (4,2)
(5,2) (6,2) (7,2) (8,2) (9,2)
(10,2) (11,2) (12,2) (13,2) (14,2)
(15,2) (16,2) (17,2) (18,2) (19,2)
(20,2) (21,2) (22,2) (23,2) (24,2)
Goal
Figure 5.1: The simple maze example introduced previously. The invariant sub-space regions are the rooms. The lower half shows four sub-MDPs, one for eachpossible room exit. The numbers are the position-in-room variable values.
to 5). The state space that the algorithm considers at this stage is the total state
space projected onto the position-in-room variable values. The transition and reward
model for this projection is explored using primitive actions. The room region (the
MER in line 6) is discovered by HEXQ by finding state transitions and associated
rewards that are invariant of room number and linking them together to form a
contiguous state-action space. Some transitions are discovered not to be invariant
of room-number. For example, moving North from position-in-room 2 transitions to
position-in-room 22 when in room 2 but position-in-room 2 when in room 0. These
unpredictable transitions are flagged as exits. In this example exits correspond to
potential ways of leaving a room via doorways but they may have a more abstract
interpretation in other problems.
Having identified a typical room region, the only motivation an agent in a room
can have is to leave it. The reason is that all immediate rewards for transitions inside
the room are negative by definition of a stochastic shortest path problem and a policy
to stay inside a room indefinitely will not be optimal. The algorithm constructs
separate sub-MDPs (line 7) that learn the value function and a policy (lines 8 and 9)
Automatic Decomposition: The HEXQ Algorithm 79
to exit a room via every possible exit. These sub-MDPs are illustrated on the bottom
of figure 5.1. Note that by projecting the total state space onto the position-in-room
variable the original representation has allowed the three rooms to be represented
as one generic room region, implicitly performing the state abstraction to this one
Markov equivalent region (MER) as described in section 4.4.1.
The learnt policy to leave a room by one of the exits can be invoked anywhere
in a room and has the effect of moving the agent out of the room. This represents
a temporally extended or abstract action. From the viewpoint of an agent that can
only sense the room it is in, performing abstract room leaving actions is all that is
necessary for it to solve the maze problem. It can learn to choose the right policy
of abstract actions to navigate from room to room and reach its goal. This abstract
problem is a semi-MDP, its abstract actions are policies to exit each room (line 10)
and its abstract states are the individual rooms using the room-number variable
(line 11).
To decompose MDPs with more than two variables the algorithm will search for
abstract regions in the abstract problem, iteratively constructing the HEXQ graph
level by level using the for-loop in lines 4 to 11.
The top level will always be just one semi-MDP that solves the original problem.
It is invoked in line 12 of the algorithm. The top level semi-MDP invokes a cascade
of abstract actions along the path down the HEXQ graph. The actions at the lowest
level are the primitive actions. In the maze example the agent will first decide on a
room leaving action that in turn invokes a lower level sub-MDP controller to execute
a sequence of primitive actions that cause the agent to leave the room.
The steps of the HEXQ algorithm are now described in greater detail.
Automatic Decomposition: The HEXQ Algorithm 80
5.1 Variable ordering heuristics
The basis of HEXQ is to discover sub-tasks that can be learnt once and then reused
multiple times in different contexts. If repetitive sub-space regions can be found in
which the agent can learn to perform tasks without reference to their wider context,
then this invariance is useful in reducing the learning effort. Skills invariant to
context need only be learnt once and then transferred and used in other contexts.
These skills can also be used in contexts unseen by the learner, providing HEXQ
with the ability to generalise.
Repetitive tasks that are performed with the highest frequency tend be the build-
ing blocks for more complex tasks. Pressing a button is a basic skill that can be
used to dial a telephone number, switch on lights or start a microwave oven. Dialing
a telephone number is a repetitive skill that in turn can be used to call a taxi, book
a flight, talk to friends, etc. To decompose a multi-dimensional state space it makes
sense to order the variables by frequency of change. In this way the variables that
change less frequently can provide a context for the more frequently changing ones.
The telephone exchange will not change state for the duration of the hand-eye co-
ordination task in pushing the button. The booking office phone will not ring until
all the numbers have been dialled. The flight will stay in an un-booked state until
the telephone call is complete.
In a similar manner subroutines in computer programs that execute most fre-
quently are associated with the lowest levels of execution. Variables that change
value more frequently should be associated with the lower levels of the hierarchy.
Conversely, variables that change less frequently set the context for the more fre-
quently changing ones. This is the intuition behind the heuristic to order variables3.
While it is not a requirement that variables must change values at different time
scales for the original MDP, it is a characteristic required to produce useful HEXQ
3Future research that dispenses with this heuristic is proposed in section 9.3
Automatic Decomposition: The HEXQ Algorithm 81
decompositions.
When searching for invariance, variables that remain constant for longer periods
of time are likely to set a more durative context for the faster changing variables.
Hence sub-space regions are formed first in HEXQ by variables that change more
frequently.
The first step is to order the variables by frequency of change. To order the
variables, the agent is allowed to explore its environment, at random, for a period
of time and statistics are kept on the frequency of change of each of the state
variables. The appropriate amount of exploration is user determined at this stage.
The variables are then sorted based on their frequency of change. The hierarchy
is constructed from the bottom up starting with the variable that changes most
frequently. Line 1 in the HEXQ algorithm in table 5.1 orders the variables.
For the simple maze example (figure 5.1), table 5.2 shows the frequency that
each variable changed value during a 2000 random action exploration run. The
position in room variable changes value more frequently than the room number
variable. The order in which the variables are therefore processed to build the
hierarchy is: X1 = position-in-room and then X2 = room-number.
HEXQ numbers the levels in the hierarchy from the bottom starting at level 1.
Lines 2 and 3 of the algorithm initialise the variable at level 1 to be the variable
that changes most frequently and the actions to be the primitive actions. By only
considering the values of one variable, all the states of the original MDP are ef-
fectively projected onto this variable. These projected states are highly aliased, in
that, the values of all the other variables are not specifically referenced. For the
maze in figure 5.1 S1 = {0, 1, 2, . . . , 24} and A1 = {North, South,East, West}.
Automatic Decomposition: The HEXQ Algorithm 82
Table 5.2: Frequency of change for the rooms example variables over 2000 randomsteps.
Variable Frequency Order
Room number 21 2
Position in Room 1631 1
5.2 Finding Markov Equivalent Regions
The aim is to find connected regions for which the transition and reward functions
for the projected state space are invariant in the context of all the less frequently
changing variables.
In line 5 the HEXQ algorithm explores the state and reward transitions for the
projected variable S1 = X1, initially for level one, but in general for level e, by taking
random actions. The amount of exploration is provided by the user and specified
as the number of times all projected state action pairs are executed. The amount
of exploration needs to be adequate to allow HEXQ to find all exits. The amount
of exploration is currently found by trial and error (but see section 9.1 on how this
may be developed in the future). Statistics on all projected state transitions and
rewards are recorded.
Line 6 partitions the projected state space in accordance with the HEXQ parti-
tion definition 4.2. For the maze, the projected state space and transitions are shown
in figure 5.2. Four of the transitions are exits because they have unpredictable out-
comes from the point of view of the position-in-room variable. The transitions are
unpredictable for multiple reasons: the problem may terminate, the room-number
may change or the transition is not stationary.
To see how the partitioning process relates to Chapter 4, the global states are
defined as (se, y) where se is the projected state and y = xe+1×xe+2 . . .×xd. Initially
e = 1 and se = x1.
Automatic Decomposition: The HEXQ Algorithm 83
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Exit
Exit
Exit Exit
Figure 5.2: The Markov Equivalent region for the maze example showing the fourpossible exits
To find the Markov equivalent regions (MERs), an action labelled directed graph
is drawn in which the nodes are the state values of variable se only and the edges are
the non-exit transitions that have some probability of occurring. Edges associated
with exits are not represented in the graph and strongly connected components in
the resultant graph are used to form the MERs.
The one MER for the maze example in figure 5.1 (a) is shown in figure 5.2 as a
directed graph. The exits indicate transitions for which the room-number variable
may change value. They are:
Automatic Decomposition: The HEXQ Algorithm 84
(s1 = 2, a1 = move-north)
(s1 = 10, a1 = move-west)
(s1 = 14, a1 = move-east),
(s1 = 22, a1 = move-south)
The states of the graph are strongly connected. The agent can be initialized
at any position-in-room value, hence all states are entry states. It is easy to verify
visually that all entry states can reach all exit states in this region.
Exits must be discovered to allow MERs to be formed. The algorithmic details
of how exits are discovered in general will be discussed next.
5.2.1 Discovering Exits
The definition of an exit is developed in section 4.1). Viewed from the perspective
of the projected state variable se, a state action pair is an exit if:
1. The MDP terminates.
2. The context changes (i.e. any higher level variable changes value).
3. The transition is between different Markov equivalent regions.
4. The transition function is not stationary.
5. The reward function is not stationary.
6. if (se, a) is an exit then it is an exit for all y.
Automatic Decomposition: The HEXQ Algorithm 85
Termination or Context Change
Exits are easily identified when a MDP terminates or the context (y variable value)
changes. In any transition from sei to se
j following action a, if any of the variables
xe+1, . . . , xd change value or the MDP terminates, then (se, a) is an exit.
Non-stationary Transitions
If transitions or reward functions vary for different values of y then they are variant
and associated exits are declared. By focusing on variable se the state space of
the total MDP has effectively been projected onto this variable. Transitions in this
projection are highly aliased in that they represent transitions for any value of the
y variable. It is possible that the probability of transition or the expected reward is
conditionally dependent on the y variable values. If this happens then the projected
transition will not be stationary (invariant of the y value) and will give rise to an
exit. Exits can be discovered by testing the transition and reward probabilities in
the context of different y variable values.
To illustrate how this may arise, figure 5.3 shows the four room example in which
there is a corner wall obstacle placed in room 0 and the reward on transition from
state (2, 0) to state (3, 0) is −10 instead of the usual −1. The probability of some
transitions are now no longer stationary from the perspective of the position-in-room
variable. A procedure is required for finding these non-stationary transitions.
For deterministic transitions it is easy to determine the non-stationary transitions
for projected states. It is only necessary to find a transition to two different next
states or with two different reward values to trigger an exit condition. In the example
in figure 5.3 the transition from state (2, 0) to state (3, 0) in room 0 has a reward
of −10 in contrast to the reward received for transitions from (2, 1) to (3, 1) or
(2, 2) to (3, 2) in rooms 1 and 2 respectively which have an associated reward of −1.
Therefore the transition from state e1 = 2 to e1 = 3 is not deterministic with respect
to reward and (e1 = 2, action = East) becomes an exit. In the case of a transition
Automatic Decomposition: The HEXQ Algorithm 86
(0,0) (1,0)(2,0) (3,0)
(4,0)
(5,0) (6,0) (7,0) (8,0) (9,0)
(10,0) (11,0) (12,0) (13,0) (14,0)
(15,0) (16,0) (17,0) (18,0) (19,0)
(20,0) (21,0) (22,0) (23,0) (24,0)
(0,1) (1,1) (2,1) (3,1) (4,1)
(5,1) (6,1) (7,1) (8,1) (9,1)
(10,1) (11,1) (12,1) (13,1) (14,1)
(15,1) (16,1) (17,1) (18,1) (19,1)
(20,1) (21,1) (22,1) (23,1) (24,1)
(0,2) (1,2) (2,2) (3,2) (4,2)
(5,2) (6,2) (7,2) (8,2) (9,2)
(10,2) (11,2) (12,2) (13,2) (14,2)
(15,2) (16,2) (17,2) (18,2) (19,2)
(20,2) (21,2) (22,2) (23,2) (24,2)
Goal
-10
Figure 5.3: The maze example with a corner obstacle and an expensive transition inroom 0 giving rise to non-stationary transitions from the perspective of the location-in-room variable.
from e1 = 7 to e1 = 8 the transition fails in room 0 but not in the other rooms.
Again this transition is not deterministic. The pair (s1 = 7, action = East) is
recorded as an exit.
In the stochastic case testing for non-stationary transitions or rewards is more
involved. It is necessary to keep transition statistics and then test the hypothesis
that the reward and state transitions come from different probability distributions
for different values of the y variable. In theory it is possible to determine this to any
degree of accuracy given that it is possible to control the sample size and test all the
individual contexts represented by the y variable value. In practice this can become
intractable because the combinations of higher level variables and hence contexts
can grow exponentially.
Instead, a heuristic is employed to find non-stationary state transitions. The
transition statistics are recorded over a shorter period of time and compared to
their long term average. The objective is to explicitly test whether the probability
Automatic Decomposition: The HEXQ Algorithm 87
distribution is stationary. From the total sample space of each transition from state
sei following action a the probability p of reaching state se
j is calculated as p =
frequency(sei , a, se
j)/frequency(sei , a). The outcome of a sample of temporally close
transitions of the same type, sei , a, se
j , is recorded. Using a binomial distribution4
based on the average probability p, the likelihood, that this temporally close sample
has a different probability, can be tested (see section A.3). When this is the case,
(sei , a) is declared an exit. This test is performed multiple times in line 5 of the
HEXQ algorithm for each type of transition as new temporally close samples become
available. The procedure is to count the number n of times sei transitions to se
j on
action a in the temporal sample of N . If n < pN then the level of significance α is
α =n∑
i=0
N !
i!(N − i)!pi(1− p)N−i. (5.1)
If n ≥ pN then the sum is taken from n to N . The significance level was set at
0.0001% with N = 17 to trigger exits. This test may fail to perform in practice and
result in falsely identifying some transitions as exits and miss others. The penalty for
recognizing extra exits is simply to generate some additional overhead for HEXQ
as these exits may require new policies to be learnt and will require additional
exploration for the extra abstract actions in the next level up the hierarchy. This
does not detract from the quality of the solution in terms of optimality.
If important exits are missed, the solution may be of poorer quality or in the
worst case, fail altogether. Chapter 9 makes a number of suggestions to improve
exit discovery. At this stage the algorithm relies on an adequate exploration period
for this statistical test to find all relevant exits.
Another test is required to detect reward function non-stationarity. Unlike states,
rewards are continuous scalar values. To test the hypothesis that the rewards for
a transition from sei to se
j , following action a, are from the same distribution, the
4The Chi squared test has been used in some versions of HEXQ with similar intent
Automatic Decomposition: The HEXQ Algorithm 88
Kolmogorov-Smirnov (K-S) test (see A.3) is applied to two consecutive temporal
samples of rewards. If the test indicates that the transition rewards in HEXQ come
from different distributions the pair (sei , a) is declared an exit.
5.2.2 Forming Regions
A HEXQ partition requires that in every Markov equivalent region (MER) an agent
must be able to reach any exit state from any entry state, with probability one,
without executing exits. To ensure that this is possible, the HEXQ algorithm in line
6 first finds the strongly connected components (SCCs) of the underlying projected
state transition graph with the nodes connected only by non-exit transitions. SCCs
have the property that all nodes can be reached with probability one. SCCs are
combined into MERs if all resultant region entry states can reach all region exit
states. The pseudo code for finding MERs is provided in table 5.3 and uses the
algorithm to find strongly connected components from section A.2.
The linear time algorithm to find SCCs requires an adjacency matrix specifying
the directed edges between nodes (table 5.3 lines 1-3).
Definition 5.1 State s′ is adjacent to state s, written adj[s][s′] and an edge is
drawn from node s to node s′ if there exists an action a such that the probability of
transition from s to s′ following action a is greater than zero and the transition is
not an exit, i.e.
∃a T ass′ > 0 and (s,a) is not an exit =⇒ adj[s][s′] (5.2)
Under some stochastic conditions it is not possible to ensure that any exit can
be reached with probability one in a multi-state region such as a room. The random
nature of the transitions may make it possible that an agent will drift through an
unintended exit. To address this situation, the function Regions keeps partitioning
the MERs, right down to single state regions if necessary, to ensure the reachability
Automatic Decomposition: The HEXQ Algorithm 89
Table 5.3: Function Regions finds all the Markov Equivalent Regions (MERs) atlevel e given a directed graph for a state space Se, Exits(Se) and Entries(Se) sets
function Regions( states Se, actions Ae, T ass′ , Exits(Se), Entries(Se))
// Find SCCs so that all states are reachable from all others
1. repeat until number of SCCs do not increase
2. for each s, s′ ∈ Se and a ∈ Ae (where s transitions to s′ on action a)
3. if (T ass′ > 0 and {(s, a)} * Exits(Se)) adj[s][s′] ← true
4. else adj[s][s′] ← false
5. SCC( states Se, adj[s][s′] )
6. for each s, a, s′ connecting two SCCs
7. adj[s][s′] ← false
8. Exits(Se) ← Exits(Se) ∪ {(s, a)}9. Entry(Se) ← Entries(Se) ∪ {s′}
10. end repeat
// Join SCCs into regions so that all region entry states can reach all exit states
11. for each s, a, s′ connecting two SCCs
12. if (SCC[s] has no other Exit or SCC[s′] has no other Entry
13. Exits(Se) ← Exits(Se)− {(s, a)}14. Entry(S) ← Entries(Se)− {s′}15. for each si if(si = SCC[s′])
16. SCC[si] ← SCC[s]
17. return |SCCs|, SCC[·], Exits(Se), Entries(Se)
end Regions
Automatic Decomposition: The HEXQ Algorithm 90
(a)
(b) (c) (d)
Figure 5.4: All actions in this example are assumed to have some probability oftransitioning to adjacent states. Part (a) illustrates two such actions near doorways.Function Regions breaks a room iteratively into single state MERs. The results ofthe first three iterations are shown as parts (b), (c) and (d).
condition of each region is met. Figure 5.4 illustrates the effect of function Regions
when the room navigation actions have some probability of transitioning to any
adjacent state on any action. Under these circumstances function Regions breaks
the room into individual state MERs which passes the exit issue to higher levels in
the hierarchy to resolve, effectively reverting to the solution of the original MDP.
This type of partitioning is achieved by repeatedly calling lines 1-10 in function
Regions until no additional SCCs are generated, ensuring that all exits are proper
in a stochastic setting. The reasoning is as follows. After each function call to SCC,
the strongly connected components of any directed graph form a directed acyclic
graph (DAG) in which the nodes themselves are the components. The state and
action associated with each edge leaving a component becomes an additional exit
and the node associated with the entering edge becomes an additional entry. This
ensures that inter-block transitions give rise to exits, in other words, that transitions
between different MERs are exits. The edges associated with these new exits are
removed from the adjacency matrix of the underlying directed graph. Removing
links in the adjacency matrix may result in additional SCCs. The interdependence
Automatic Decomposition: The HEXQ Algorithm 91
of the exits and HEXQ partitions means that the algorithm to find SCCs must be
rerun every time new exits are introduced as the adjacency matrix is changed.
The process stops when each state in a SCC can be reached from any other state
with a probability greater than zero without being forced to exit. When a MER
consists of only one state, this state is both an entry and an exit state and the
condition is met trivially. A policy can therefore be created that can reach any state
with probability one without leaving the SCC. With deterministic actions, only one
pass is required to find SCCs as the removal of edges between SCCs will not remove
any edges within SCCs (as may be the case for stochastic actions).
The SCCs found in lines 1-10 of function Regions could be equated to MERs
that are later abstracted to form higher level states. For an efficient decomposition
the aim is to maximize the size of the MERs in order to minimize the number of
abstract states. SCC over-specify the partition requirements. The only requirement
is that all exit states be reachable from all entry states, not that all states can reach
each other.
It may be possible to combine some SCCs to form larger MERs. The SCCs
still form a DAG when the edges removed in lines 6 to 9 of function Regions are
reinstated. Two connected SCCs with edges say from scci to sccj can be joined if
scci has no other exits or sccj has no other entering edges. In lines 11-16 of function
Regions, table 5.3, SCCs are tested for combination opportunities with the proviso
that they do not violate exit reachability requirements. The Regions algorithm
returns the MERs as MERe = SCC[·] in line 17. A MER, therefore, is a single
SCC or a combination of SCCs such that any exit state in the region can be reached
from any entry with probability 1. When combining SCCs to form regions a mixture
of MERs and SCCs can be combined in the same way.
Figure 5.5 illustrates the combination process. It shows four SCCs labelled 0 to
3 connected as a DAG with external entry and exit edges. SCC scc0 and scc1 can
be joined to form MER0 because scc1 has no other entries other than the one from
Automatic Decomposition: The HEXQ Algorithm 92
scc0 scc1
scc2 scc3
MER0
MER1
Figure 5.5: Four SCCs joined to form two MERs
scc0. SCC scc2 can be joined with scc3 to form MER1 because scc2 has no other
exit edges besides the one to scc3. SCC scc0 cannot be joined to scc2, for example,
because scc0 has another exit edge (to scc1) and scc2 has another entry edge from
outside. MER0 cannot be combined with MER1 for similar reasons.
The repeated call on the SCC function means that in the worst case finding
MERs now takes O([|Se|+ |Ae|]2) time. This in itself is not an issue as the number
of states are only those abstracted at level e, that is, unless the number of states
grows exponentially because the overall problem will not decompose efficiently. The
MERs are arbitrarily labelled for convenience with consecutive integers starting from
0. This completes line 6 in function HEXQ in table 5.1.
5.3 Creating and Solving Region Sub-MDPs
Now that MERs have been found that are invariant in the context of all other
combination of higher level variable values, the HEXQ algorithm proceeds to find a
set of policies for these regions (lines 7 to 9). The only motivation an agent can have
in a region is to exit, as all non-exit transitions have a negative reward. One policy
is found to reach each exit in the MER. A policy to reach an exit state followed by
Automatic Decomposition: The HEXQ Algorithm 93
an exit action comprise an exit from the MER. The policies and their exit actions
are the abstract actions that are available at the next higher level in the hierarchy.
The rest of this section will explain how the cache of exit policies for each MER is
found by creating and solving sub-MDPs for each of the exits in each of the MERs.
One sub-MDP could be defined for each exit as described in section 4.2.2. How-
ever, if two or more exits have the same set of hierarchically nested exit states,
then it is not necessary to create multiple sub-MDP because once they are reached,
the different exits can be executed simply by executing their different primitive exit
actions. It may not be sufficient to just have one sub-MDP for each abstract exit
state. The issue only arises when there are more than two levels in the hierarchy.
Consider, for example, the nested set of abstract states in figure 5.6. The shaded
region in the diagram represents a MER at level 2. It contains 4 abstract level 2
states, S20 , S2
1 , S22 and S2
3 . Abstract state S21 is an aggregate of level 1 states S1
0 , S11
and S12 . There are two exits from the level 2 MER, both from exit state S2
1 . The
value function to exit the MER will be different for each of these two exits because
one of the exits requires an addition internal transition. Therefore two sub-MDPs
are required because the states S10 and S2
1 differs from the set S12 and S2
1 in the
hierarchy.
Definition 5.2 (Hierarchical State) For an MDP with state x ∈ X, the hierar-
chical state s for x at level e is the ordered sequence (s1, s2, . . . , se), where si ∈ Si
from step 11 in the HEXQ algorithm in table 5.1. Note the level e to which hierar-
chical state s is defined is always implied by the context in which it is used.
Abstract state si can be determined from x as will be seen later by equation
5.5. A hierarchical exit state at any level is the hierarchical state associated with
an exit. It is the sequence of states in an exit. The hierarchical exit state for the
exit (se, (se−1, (. . . (s1, a1) . . .))) at level e is (s1, s2, . . . , se). When hierarchical exit
states are the same for different exits, HEXQ only needs to construct one sub-MDP
for these exits.
Automatic Decomposition: The HEXQ Algorithm 94
a2
20s
23s
21s
22s
11s
12s
10s
a2
a2
a1 a1
b1
b1
Figure 5.6: Two exits of a level 2 MER that requires 2 separate level 2 sub-MDPseven though both exits have the same level 2 exit state, S2
1 .
The construction of sub-MDPs (line 7 of the HEXQ algorithm) proceeds as out-
lined in section 4.2.2. One sub-MDP is created for each hierarchical exit state of each
MER. Recall that exits from other hierarchical exit states are not allowed by con-
struction. This ensures that the policy cache contains policies that are guaranteed
to exit from only one exit state.
HEXQ has already gathered considerable statistics, modelling the state transi-
tion and reward functions in the previous partitioning steps. It is therefore efficient
to solve the sub-MDPs using dynamic programming. A dynamic programming al-
gorithm for solving the sub-MDPs is given in table 2.15.
The hierarchically optimal policy for these sub-MDPs can also be learnt or im-
proved by using temporal difference methods such as Q learning concurrently with
exploration at the next level. It is important to remember to restrict actions for
other exit states to non-exit actions during learning, even for exploration, as un-
intended exits at any level will mislead the learner. The HEXQ implementation
5The standard process is only complicated by the need to recursively calculate the value of thepost transition states s′ from the decomposed value function as per equation 4.8 and describedalgorithmically in subsection 5.4. This means that the value function in step 11 of table 2.1 becomesV (s′) ← value(e, sub-MDP m, s′). The rewards Ra
ss′ in table 2.1 are interpreted as the primitiverewards on exit and the returned function Q∗(s, a) is E∗
m(se, ae).
Automatic Decomposition: The HEXQ Algorithm 95
continues to update the E functions at all levels using Q-learning as shown later in
section 5.6. The dynamic programming solution is based on the sample transitions
that are explored and may only have provided an approximation to the true tran-
sition probability distributions. The benefit of continuous updating is to refine the
value function and policy.
In the simple maze example there is one MER and four exit states as illustrated
in figure 5.2. Consequently four sub-MDPs are created, one for reaching each of the
exit states. These sub-MDPs are as illustrated in the bottom of figure 5.1. This
completes lines 8 & 9 of the HEXQ algorithm in table 5.1.
5.4 Hierarchical State Value
HEXQ requires the optimal value of a hierarchical state to be reconstructed from
the decomposition equations 4.9 and 4.10. The pseudo code for this procedure for
hierarchical state s at level e is given in table 5.4 and is similar to EvaluateMaxNode
(Dietterich, 2000).
The hierarchical state value is required by HEXQ in line 9 in table 5.1 and
to update the HEXQ function E during temporal difference learning. When the
hierarchical state is the current state of the agent, the next optimal (greedy) action
to take at all levels in the hierarchy is calculated as a byproduct of this procedure.
Function value implements a depth-first-search as in MAXQ. A depth-first-search
can become expensive (Dietterich, 2000), as the branching on abstract actions means
an exponentially increasing number of paths to search with the number of levels.
Chapter 8 addresses ways to contain the search.
5.5 State and Action Abstraction
Once a policy cache has been established for each MER it is possible to reformulate
and reduce the original MDP by eliminating a variable. In the maze example, once
Automatic Decomposition: The HEXQ Algorithm 96
Table 5.4: Procedure for evaluating the optimal value of a hierarchical state in aHEXQ graph. The function returns the optimal value of the hierarchical state sbased on the optimal policy for each sub-MDP and finds the best greedy action atevery level up to e
function value( level e; sub-MDP m; hierarchical state s )
1. if e = 0 return 0
2. A ← actions available in m
3. v ← −∞4. ae∗ ← undefined (best greedy action at level e)
5. for each a ∈ A
6. me−1 ← sub-MDP associated with a at level e− 1
7. v′ ← value(e− 1, me−1, s)+E∗m(se, a)
8. if (v′ > v)
9. v ← v′
10. ae∗ ← a
11. return v, ae∗
end value
room leaving policies are established, the position-in-room variable can be eliminated
and a reduced semi-MDP defined using the room-number as the state and the room
leaving policies as actions, albeit temporally extended. This was shown in figure
4.12 Chapter 4.
When there is more than one MER at the next level, projected states may consist
of all possible combinations of MER types and values of the next variable in the
frequency order. Lines 10 and 11 of the HEXQ algorithm prepare the abstract states
and actions for the semi-MDP at the next level.
Abstract actions at the next higher level are policies for exiting regions at the
current level. They are the optimal policies to exit the sub-MDPs. The notation
that is used to describe an abstract action at level e + 1 is (se, ae) where se is an
exit state and ae is an (abstract) exit action at level e. This is the same notation to
that used to describe exits at level e. There is a 1 : 1 correspondence between all
Automatic Decomposition: The HEXQ Algorithm 97
the exits at one level and the abstract actions at the level above. Given the set of
MERs, MERe, at level e, the set of all abstract actions Ae+1 at level e + 1 is given
by
Ae+1 =⋃
i∈MERe
{Exits(i)}. (5.3)
Taking abstract action (se, ae) at level e + 1 means that the agent is required
to use the optimal policy for the right sub-MDP at level e to move to exit state se
and then execute action ae. In the maze example the set of four abstract actions is
A2 = { (s1 = 2, a = move-north), (s1 = 10, a = move-west), (s1 = 14, a = move-
east), (s1 = 22, a = move-south) }. The taxi example of section 6.1.2 illustrates a
case where the number of abstract actions is greater than the number of sub-MDPs
required.
Abstract projected states at the next level Se+1 are formed by the cartesian
product of the set of MER labels for the current level e and the set of values for the
next state variable in the ordering Xe+1.
Se+1 = MERe ×Xe+1 (5.4)
For convenience, to easily index arrays in the code, the numerical method used
to determine an abstract variable value se+1 for state (se, xe+1 × . . .× xd) is
se+1 = |MERe|xe+1 + me
where
|MERe| = number of MERs at level e
me = MER label for level e (see section 5.2.2) such that se ∈ MER.
(5.5)
It is generally the case that different abstract states have different abstract ac-
tions. The semi-MDP that is defined for the next level has one less variable than
Automatic Decomposition: The HEXQ Algorithm 98
the previous one and uses only abstract actions. The procedure (lines 5 to 11 in
function HEXQ table 5.1) is repeated, finding MERs and exits using the abstract
states and actions at the next level. In this way, one level of hierarchy is generated
for each variable in the original MDP. When the last state variable is reached, the
top level sub-MDP, represented by the final abstract states and actions, solves the
overall MDP.
5.6 HEXQ Execution
The algorithm for executing HEXQ at any level, including the top level, is given
in table 5.5. It executes the HEXQ hierarchy, recursively calling itself as it invokes
lower level policies. At each level it remembers the abstract state. On return,
following the execution of an abstract or primitive action, it updates the E value
using Q-Learning as per table 2.2.
Hence, the final step in the HEXQ algorithm is to Execute(top level, starting
state, 0). There is only one sub-MDP at the top level that is invoked with abstract
action 0. When function Execute returns, the problem has terminated and can be
restarted by re-initialising the problem according to the starting state distribution
X0 and calling Execute again6.
5.7 Efficiency Improvements
There are a number of ways that HEXQ can be improved. The following two ideas
have not been implemented at this time.
6Some refinements have not been included in the pseudo code. The exit for the top level sub-MDP may have reward on termination, depending on the problem. A condition to stop the learningprocess could be included.
Automatic Decomposition: The HEXQ Algorithm 99
Table 5.5: Function Execute solves sub-MDP m associated with abstract action aat level e. The state s is the current hierarchical state at each level depending oncontext in which it is used. lse is the last projected state at level e. The learningrate β is set to a constant value. The E tables are originally initialised to 0.
function Execute(level e, state s, action a)
1. if (e = 0)
2. execute primitive action a and observe s and reward
3. return
4. m ← sub-MDP associated with action a
5. repeat
6. act ← value( level e, sub-MDP m, state s) or exploration policy
7. lse ← abstraction of s at level e
8. Execute(level e− 1, state s, action act)
9. if((s, act)=exit) return
10. else
11. V ′ ← value( level e, sub-MDP m, state s)
12. Em(lse, act) = (1− β)Em(lse, act) + β(reward + V ′)
end Execute
5.7.1 No-effect Actions
If at level e an action a always leaves the hierarchical state s unchanged then no
HEXQ value function entries E(se, a) for this state action pair need be stored. For
stochastic shortest path problems since the rewards are always negative, a hierarchi-
cal state transition to itself with probability one will never be a part of an optimal
policy. This action can therefore be safely eliminated from being available in such
a state. The savings in storage and learning time can be significant, as HEXQ will
otherwise explore and store values for all allowable actions.
If the action causes an exit in some states then exits are still noted, but the value
function E is by definition zero and does not need to be stored for any exit actions,
except at the top level if there is a final reward on exit.
Automatic Decomposition: The HEXQ Algorithm 100
5.7.2 Combining Levels
The HEXQ algorithm described in this chapter builds one level of hierarchy for each
variable in the state description. It may be the case that a sub-set of variables are so
interdependent that the MERs are too small to provide an effective decomposition.
When the MERs are individual states no state abstraction is possible and if all the
primitive actions remain as exits then there is no advantage in introducing another
level in the hierarchy. If this is the case, it is more efficient to combine the variables
by taking their Cartesian product directly and skipping a level in the hierarchy.
The decision to combine variables directly can be algorithmically determined if, for
example, the storage space for the action-value function E is the criteria.
Assume the states at level e are se and the next variable to be processed is xe+1.
Say the ith region has |sei | states, the number of actions per state is |ae
i |, the number
of exits is |eei | and the number of sub-MDPs required is |me
i |.The storage requirement with separate levels in the HEXQ hierarchy is:
StwoLevels =∑
i
(|sei |.|ae
i |.|mei |) + |xe+1|
∑i
|eei | (5.6)
If the variables se and xe+1 are combined, then the storage requirement is:
ScombinedLevel = |xe+1|.∑
i
(|sei |.|ae
i |) (5.7)
The decision to combine variables would be triggered when ScombinedLevel <
StwoLevels. These calculations do not include no-effect actions. An examination
of the equations shows why HEXQ has the potential to improve the space complex-
ity. In StwoLevels the number of values in each of the two state variables appear in
separate terms that are added. For the combined storage the size of the two state
spaces are multiplied.
Automatic Decomposition: The HEXQ Algorithm 101
5.8 Conclusion
This chapter explained the operation of the HEXQ algorithm in detail. HEXQ
automatically decomposes and solves a stochastic shortest path multi-dimensional
MDP. It does this by eliminating one variable at a time, searching for invariant
Markov equivalent regions and collapsing the regions into abstract states. Each
region is equipped with its own cache of intra-region policies. The policies are
abstracted to form abstract actions. The abstract states and actions form a new
semi-MDP with one less variable. Repeated abstractions hierarchically abstract the
original MDP. The next chapter will test HEXQ in a variety of different domains.
Chapter 6
Empirical Evaluation
This chapter will show how HEXQ performs in a number of different domains. The
first one is the taxi task. This example was created by Dietterich (2000) to illustrate
MAXQ. For MAXQ the structure of the hierarchy is specified by the user. HEXQ
decomposes and solves this problem automatically. The taxi task is then varied by
• representing the state space with one more variable to show how HEXQ auto-
matically adapts to exploit any new constraints,
• forcing an order on some of the variables to demonstrate the robustness of
HEXQ,
• making the passenger “fickle” to show that HEXQ finds a different decompo-
sition to avoid an otherwise poorer solution and
• including fuel to demonstrate how HEXQ manages the hierarchical credit as-
signment problem.
The other domains covered are: a 25 rooms problem to test for region boundaries
without higher level variables changing value; the Towers of Hanoi puzzle with 7
discs demonstrating HEXQ building a hierarchy with 7 levels; and a ball kicking
102
Empirical Evaluation 103
task where HEXQ saves 5 orders of magnitude in storage space over a flat1 learner
making the learning task feasible.
6.1 The Taxi Domain
In the taxi domain shown in figure 6.1, a taxi started at a random location navigates
around a 5-by-5 grid world to pick up and then put down a passenger. There are four
possible source and destination locations, designated R, G, Y and B encoded 1, 2, 3,
4 respectively. The objective of the taxi agent is to go to the source location where
the passenger is waiting, pick up the passenger, and navigate with the passenger
in the taxi to the destination location. Once there the agent has to put down the
passenger to complete its mission. The source and destination locations are also
chosen at random for each new trial. At each step the taxi can perform one of six
primitive actions. There are four navigation actions: move one square to the North,
South, East or West, one action to pickup the passenger and one action to putdown
the passenger. A move into a wall or barrier leaves the taxi location unchanged. For
a successful passenger delivery the reward is 20. If the taxi executes a pickup action
at a location without the passenger or a putdown action anywhere other than at the
destination location it receives a reward of -10. For all other transitions the reward
is -1. The trial terminates following a successful delivery. The navigation actions
are stochastic, in the sense that, with an 80% probability the action is performed as
intended and with a 20% probability the agent attempts to move either to the left
or to the right of the intended direction with equal chance.
1To distinguish a normal MDP from a hierarchically decomposed structure the former is referredto as flat.
Empirical Evaluation 104
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 22 23
Figure 6.1: The Taxi Domain.
6.1.1 Automatic Decomposition of the Taxi Domain
Testing HEXQ on this domain provides an example of how HEXQ decomposes a
3-dimensional problem forming a three level hierarchy. The HEXQ algorithm from
Chapter 5 is reproduced here for convenience and will be explained step by step.
The taxi problem can be formulated as an episodic MDP with the 3 state vari-
ables: the location of the taxi (values 0-24), the passenger location including being
in the taxi (values 0-4, 0 means in the taxi) and the destination location (values
1-4). It is easy to see that for the taxi to navigate to one of the source or destina-
tion locations, the navigation policy can be the same whether it intends to pick up
or put down the passenger. The usual flat formulation of the MDP will solve the
navigation subtask as many times as it reoccurs in the different contexts.
Dietterich (2000) shows how, by designing a MAXQ hierarchy, the problem is
solved more efficiently with subtask reuse. The problem will now be automatically
decomposed and solved with HEXQ. Progress through the HEXQ algorithm will be
traced in detail.
The first step in the HEXQ algorithm (table 6.1) is to order the variables by
frequency of change. For the taxi example, table 6.2 shows the frequency that each
Empirical Evaluation 105
Table 6.1: The HEXQ algorithm
function HEXQ( MDP〈states = X, actions = A〉 )
1. X ← sequence of variables 〈X1, X2, . . . , Xd〉 sorted by frequency of change
2. S1 ← X1
3. A1 ← A
4. for level e ← 1 to d− 1
5. explore Se, Ae transitions at random to find T ass′ , Exits(Se), Entries(Se)
6. MERe ← Regions(Se, Ae, T ass′ , Exits(Se), Entries(Se))
7. construct sub-MDPs from MERe using Exits(Se)
8. for all sub-MDPs m
9. E∗m(se, ae) ← V alueIteration(sub-MDP m, γ = 1 )
10. Ae+1 = ∪i∈MEReExits(i)
11. Se+1 = MERe ×Xe+1
12. Execute(level d, initial state, top level sub-MDP)
end HEXQ
variable changed value during a 2000 random action exploration run. The duration
of 2000 steps is specified by the user. The taxi location variable changes value more
frequently than the others. The passenger location variable only changes value
when the passenger is picked up and the destination variable remains constant for
the entire trial. The order in which the variables are therefore processed to build
the hierarchy is: taxi location, passenger location and finally destination.
Having selected the taxi location as the first variable, the algorithm performs
Table 6.2: Frequency of change for taxi domain variables over 2000 random steps.
Variable Frequency Order
Passenger location 4 2
Taxi location 846 1
Destination 0 3
Empirical Evaluation 106
the decomposition of the state space into MERs in line 6 of table 6.1. The state
space is described by (s1, y) where s1 = Taxi location and y = Passenger location ×Destination. To find the directed graph representing transitions between values of
the taxi location, the taxi agent again performs random actions until all primitive
actions have been taken a predefined 180 number of times for each state variable
value. The 180 is user defined for this problem and is selected in such a way that
all exits are consistently found for up to 100 runs of the experiment. The results of
random trials for ordering the variables could be reused for finding regions, but this
efficiency has not been implemented.
For each state s1 and primitive action the frequency of the next s1 state variable
values and the rewards received are recorded. These statistics have two uses. In
this decomposition step they are used to decide whether there are any non-Markov
transition exits by using the binomial and K-S tests as described in section 5.2.1.
Later they are used to solve the sub-MDPs using value iteration. For the taxi
example the directed graph for the taxi location variable is shown in Figure 6.2. Exit
transitions are not counted as edges. For example, taking action pickup or putdown
in state s1 = 23 may change the passenger location variable or reach the goal,
respectively. This means that this transition may change the y variable. A change
in the y variable value indicates an exit. However, it is possible to probabilistically
predict the transitions anywhere else in the environment. From state 7, taking
action North for example, transitions to state 2, 80% of the time, 10% of the time
to state 8 and 10% to state 7. Only one directed edge is drawn to represent multiple
transitions between the same states in Figure 6.2 to avoid cluttering the diagram.
All states are entry states because the taxi agent is started at random in any
location. As the whole graph for the taxi location variable is strongly connected,
one MER is formed which meets the condition of a HEXQ partition that all entry
states must reach all exits states without leaving the region. The MER is labelled
0. It has eight exits. They are:
Empirical Evaluation 107
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
exits
exitsexits
exits
Figure 6.2: The directed graph of projected state transitions for the taxi location.Exits are not counted as edges of the graph.
(s1 = 0, a1 = pickup), (s1 = 0, a1 = putdown),
(s1 = 4, a1 = pickup), (s1 = 4, a1 = putdown),
(s1 = 20, a1 = pickup), (s1 = 20, a1 = putdown),
(s1 = 23, a1 = pickup), (s1 = 23, a1 = putdown).
Line 7 of the HEXQ algorithm now creates 4 sub-MDPs as shown in figure 6.3.
Each sub-MDP has a sub-goal to reach one of the exit states. The sub-MDPs are
solved using value iteration with the transition model (transition probabilities and
rewards) determined from the recorded statistics in line 5 of the HEXQ algorithm.
The solution to each sub-MDP m determines the optimum HEXQ action value func-
tion E∗m. As this is level 1, these values are identical to the Q action value function.
The value E∗m(s1, a1) is the expected maximum reward that can be accumulated on
the way to the exit state for sub-MDP m by starting in state s1, taking action a1
next and then continuing with the optimum policy.
Empirical Evaluation 108
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
exits
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
exits
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
exits
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
exits
Figure 6.3: The four level 1 sub-MDPs for the taxi domain, one for each hierarchicalexit state, constructed by HEXQ.
Step 10 of the HEXQ algorithm generates abstract actions for the next level by
taking the union of all exits over all MERs. Abstract actions are encoded in the
same way as exits. The 8 abstract actions are the 8 exits listed above.
Line 11 of the HEXQ algorithm generates abstract state variable values for the
next level. Applying the method in equation 5.5 simply generates variable values
such that s2 = x2 as there is only one MER labelled 0. In other words, the abstract
state variable values at level 2 are the same as the values of the next most frequently
changing state variable, the passenger location. Therefore there are 5 abstract states
and 8 abstract actions at level 2.
At this stage HEXQ loops back to step 5 to decompose the level 2 state space.
Figure 6.4 shows the directed graph generated after a random execution of abstract
actions, exploring each action from each state 5 times (user defined). There are 5 s2
abstract state nodes. Each one represents the abstracted MER from level 1. In the
figure one of the level 2 abstract states, state 3, is illustrated with the detail of the
level 1 MER that it represents. Level 2 edges show the effect of the abstract actions.
Empirical Evaluation 109
With the passenger at a particular source location 1, 2, 3 or 4, the only abstract
action that successfully places the passenger in the taxi is the one that navigates
the taxi to passenger source location and performs a pickup. Other abstract actions
leave the passenger location unchanged. The transitions in figure 6.4 that cause a
change to state 0 are labelled with the abstract actions.
With the passenger in the taxi, s2 = 0, there are four abstract actions that cause
exits. These are the ones that navigate to one of the source/destination locations
and perform a putdown primitive action. They are shown in figure 6.4. For example
exit (s2 = 0, (s1 = 23, a1 = putdown)) means: with the passenger in the taxi,
navigate to location s1 = 23 and putdown the passenger. They are exits because
they may cause the MDP to terminate (see exit definition 4.1). The exit notation
at level 2 is (s2, a2) where a2 = (s1, a1).
1
2
4
0
(s1=0, a=pickup)
(s1=4, a=pickup)
(s1=20, a=pickup)(s1=23, a=pickup)
(s2 = 0, (s1 = 0, a = putdown)) (s2 = 0, (s1 = 4, a = putdown)) (s2 = 0, (s1 = 20, a = putdown)) (s2 = 0, (s1 = 23, a= putdown))
Level 2 exits {
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
3
Figure 6.4: State transitions for the passenger location variable at level 2 in thehierarchy. There are 4 exits at level 2..
This time the directed graph is not strongly connected. The degenerate 5 SCCs
Empirical Evaluation 110
found by the algorithm are the individual states. However, because the problem is
never started with the passenger in the taxi, state s2 = 0 is not an entry state. This
means that the SCCs are merged into one MER which complies with the reachability
condition of a HEXQ partition that all entry states can reach all exit states.
The hierarchical exit states for this MER are (s2 = 0, s1 = 0), (s2 = 0, s1 = 4),
(s2 = 0, s1 = 20), (s2 = 0, s1 = 23). From subsection 5.3, because the hierarchical
exit states are all different for the four exits at level 2, it is necessary to create 4
sub-MDPs at level 2.
Four abstract actions corresponding to the policies of the four sub-MDPs (one
for each exit) are created for level 3. The abstract states s3 generated for level 3
correspond to the destination states as there is only one MER at level 2. The HEXQ
algorithm now drops to step 11 to solve the top level sub-MDP. This sub-MDP is
trivial to solve as each abstract action either directly solves the problem or leaves
the agent in the same state, that is, with the passenger still to be delivered. The
directed graph for level 3 is shown in figure 6.5. Note the nesting in the description
of abstract actions.
The resultant HEXQ graph for the decomposed MDP is shown in figure 6.6.
There are 9 sub-MDPs in total in the hierarchy. To illustrate the execution of a
competent taxi agent on the hierarchically decomposed problem, assume that the
taxi is initially located randomly at cell 5, the passenger is on rank 4 and wants to
go to rank 3.
In the top level sub-MDP, the taxi agent perceives the passenger destination as
3 and takes abstract action a3 = (s2 = 0, a2 = (s1 = 20, a1 = putdown)). This sets
the subgoal state at level 2 to s2 = 0 or in English, pick up the passenger first. At
level 2, the taxi agent perceives the passenger location as 4, and therefore executes
abstract action (s1 = 23, a1 = pickup). This abstract action sets the subgoal state
at level 1 to taxi location s1 = 23 i.e. location 4. The level 1 policy is now executed
using primitive actions to move the taxi from location s1 = 5 to the pickup location
Empirical Evaluation 111
1
2
3
4
(s2 = 0, (s1 = 0, a = putdown))
(s2 = 0, (s1 = 4, a = putdown))
(s2 = 0, (s1 = 20, a = putdown))
(s2 = 0, (s1 = 23, a = putdown))
Figure 6.5: The top level subMDP for the taxi domain showing the abstract actionsleading to the goal.
s1 = 23 and the pickup action is executed on exit. Level 1 returns control to level 2
where the state has transitioned to s2 = 0. Level 2 now completes its instruction and
takes abstract action (s1 = 20, a1 = putdown). This again invokes level 1 primitive
actions to move the taxi from location s1 = 23 to s1 = 20 and then putdown to exit.
Control is returned back up the hierarchy and the trial ends with the passenger
delivered correctly.
The Taxi example illustrates how HEXQ automatically decomposes a 3-dimensional
MDP without the benefit of a prior model. The construction of level 2 in the hi-
erarchy demonstrates the merging of five SCCs to form one MER. While there was
only one exit state at the same level (s2 = 0), 4 sub-MDPs are required as each
hierarchical exit state is unique. In total 9 sub-MDPs are generated to construct
the HEXQ hierarchy.
Empirical Evaluation 112
Level 3 sub-MDP
4 Level 2 sub-MDPs
4 Level 1 sub-MDPs
4 abstract Actions
8 Abstract Actions
6 Primitive Actions
Level 3
Level 1
Level 2
Figure 6.6: The HEXQ graph showing the hierarchical structure automatically gen-erated by the the HEXQ algorithm.
6.1.2 Taxi Performance
It is possible to demonstrate performance improvements over a flat reinforcement
learner for the simple taxi task both in terms of the number of primitive time steps
required to reach optimality and in the size of the table required to store the value
function.
For the experiments, the stochastic taxi task described previously is used. The
initial value functions, Q for the flat MDP, C for MAXQ and E for HEXQ are ini-
tialised to zero. The learning rate for all algorithms was set to a constant 0.252. Each
algorithm is allowed to choose actions greedily and relies on the natural stochasticity
and properties of the domain to explore all the states. HEXQ is run as described
above, automatically constructing the HEXQ graph. To compare it to MAXQ3 with
2The effectiveness of convergence with this learning rate is examined in appendix B3The implementation MAXQ used for these experiments is that interpreted and programmed
by the author.
Empirical Evaluation 113
a user provided decomposition, HEXQ is also run with a previously found HEXQ
graph, thereby excluding the hierarchy construction load from the performance.
Simple one-step value backups are used to learn the value function for HEXQ,
MAXQ and the flat reinforcement learner. When HEXQ includes construction it
uses action-value iteration to find the optimal value function of the sub-MDPs after
it has learnt the model by statistically sampling transitions as described in Chapter
5.
-5
-4
-3
-2
-1
0
1
0
1000
0
2000
0
3000
0
4000
0
5000
0
6000
0
7000
0
8000
0
9000
0
1000
00
Time Steps
Ave
rage
Rew
ard
per
Tim
e S
tep
HEXQ including construction
Flat
MAXQ
HEXQ excluding
construction
Figure 6.7: Performance of HEXQ with and without hierarchy construction againstMAXQ and a flat MDP for the stochastic taxi.
The results are shown in figure 6.7. The average primitive reward received per
primitive time step is plotted against the number of primitive time steps elapsed.
A trial ends whenever the taxi agent delivers the passenger. After each trial, the
taxi agent is restarted again with random variable values and the learning contin-
ues. Each 100, 000 primitive time steps constitute a run. After each run the whole
problem is reset to the state before hierarchy construction and learning takes place.
The graph shows the performance averaged over 100 runs with the average reward
Empirical Evaluation 114
per time step averaged over 100 time steps.
Comparing “HEXQ with automatic hierarchy construction” against the flat learner,
the graph shows that HEXQ converges to the optimum policy in about half the num-
ber of time steps. The flat learner is able to start improving its policy right from
the start while HEXQ must first order the variables and find level 1 exits before it
can start to improve its performance. More recently, Potts and Hengst (2004) have
shown how learning can be speeded up by constructing regions at multiple levels
concurrently. Once the navigation MER is discovered, HEXQ learns very rapidly as
it transfers knowledge and reuses the optimal navigation policies in other contexts.
It is of course possible to improve both HEXQ and the flat learner using multi-step
backups or prioritised sweeping (Moore and Atkeson, 1993). The flat learner is
expected to benefit most from these techniques, making the case for hierarchical de-
composition less convincing for smaller problems. Any flat learner will eventually be
defeated by an exponential growth in the state space as the problem size increases,
which is not the case when problems can be efficiently decomposed.
As MAXQ is a priori provided with the hierarchical decomposition, for compari-
son HEXQ is tested with a decomposition found in a previous run. Not surprisingly,
as can be seen in figure 6.7, both “HEXQ excluding the hierarchical construction”
and MAXQ outperform the flat learner and “HEXQ including construction”. MAXQ
outperforms HEXQ. This is explained by the additional background knowledge pro-
vided to MAXQ in that pickup and putdown actions do not have to be explored
for the navigation task. HEXQ does not make this assumption and early average
reward per step is worse as it tests pickup and putdown actions in all possible taxi
locations.
The early dip in performance in the graph for “HEXQ excluding the hierarchical
construction” is caused by simultaneously learning the value function at multiple
levels. Lower level sub-MDPs initially have inflated costs to reach exits before
achieving optimal policies. These inflated values are incorporated in higher level
Empirical Evaluation 115
Table 6.3: Storage requirements in terms of the number of table entries for thevalue function for the flat MDP, MAXQ, HEXQ, HEXQ with stochastic actions aftereliminating no-effect actions and HEXQ with deterministic actions after eliminatingno-effect actions.
Level Flat MAXQ HEXQ HEXQ HEXQ
stochastic actions deterministic actions
no-effect actions no-effect actions
eliminated eliminated
3 16 16 16
2 160 160 160
1 600 396 272
Total 3000 632 776 572 448
value functions and cause sub-optimal behaviour in the process of being corrected.
Storage requirements for the value function are shown in table 6.3. A flat learner
uses a table of 500 states and 6 actions = 3000 values. HEXQ requires 4 MDPs
at level 1 with 25 states and 6 actions = 600 values; 4 MDPs at level 2 with 5
states and 8 actions = 160 values; 1 MDP at level 3 with 4 states and 4 actions =
16 values - a total of 776 values. MAXQ by comparison requires only 632 values
(Dietterich, 2000). Interestingly, if a HEXQ graph is generated with the no-effect
actions efficiency improvement in subsection 5.7 then only 572 E values are required
to represent the complete decomposed value function with stochastic actions and 448
E values with deterministic actions as all pickup and putdown actions have no effect
at level 1.
This simple taxi domain has empirically supported the theory in Chapter 4 with
a reduction in computational complexity for HEXQ compared to a flat learner. The
automatically constructed reinforcement learning hierarchy performs almost as well
as the hand coded MAXQ hierarchy when compared on an equal footing. If no-
effect actions are automatically eliminated, it is conjectured that HEXQ excluding
construction would at least equal MAXQ in performance.
Empirical Evaluation 116
6.1.3 Taxi with Four State Variables
For a better understanding of the HEXQ algorithm it is instructive to experiment
further with the taxi domain by defining an additional variable and experimenting
with the order of variable processing.
The taxi problem characteristics are left unchanged to those in the previous
section, except that the representation of the taxi location is now given by two
variables that represent the x and y coordinates in its 5 × 5 grid world illustrated
in figure 6.9 (a). For example, the taxi in the figure is shown in location (3, 1).
This means that the state is represented by the four variables: passenger location,
destination location, taxi x location and taxi y location.
Running the HEXQ algorithm with this state description, the taxi y location
variable is found to change more frequently than the taxi x location. The reason is
that the internal barriers in the grid world prevent the taxi from moving in the x
direction on some random instances. The y variable which can take on 5 different
values is therefore selected to construct level 1 in the HEXQ graph. This time any
navigation action from any taxi y location may change the taxi x location. Because
of the stochastic nature of the transitions, pickup and putdown actions do not cause
transitions between taxi y locations. Consequently all five taxi y location states are
separate MERs. Three of the MERs have 4 exits each and the other two MERs
which are the source and destination locations have 6 exits each. The sub-MDPs at
level 1 each have one state and their optimal policies are simply their exit actions.
Equation 5.5 will generate 25 states for variable s2 from both the next variable,
the taxi x location in the ordering and the 5 MERs from level 1. Because the
number of exits vary for the MERs, the actions available in each s2 state will vary.
Essentially the pickup and putdown actions will not be explored for any taxi x
location when the taxi y location is either 1,2 or 3 as they have already been ruled
out by the optimal policies at level 1.
HEXQ will now find one 25 state MER in the s2 abstract state space with 8 exits
Empirical Evaluation 117
-5
-4
-3
-2
-1
0
1
0
1000
0
2000
0
3000
0
4000
0
5000
0
6000
0
7000
0
8000
0
9000
0
1000
00
Time Steps
Ave
rage
Rew
ard
per
Tim
e S
tep
HEXQ with three state variables
HEXQ with four state variables
Figure 6.8: Performance of the stochastic taxi with 4 variables compared to thethree variable representation.
and complete the hierarchy construction as in the previous case. This is an example
where the HEXQ partition did not result in a useful state abstraction opportunity.
The performance between HEXQ with three and four variables is shown in figure
6.8. The major difference in performance is that the average reward per step is
reduced while exploring level 2 as pickup and putdown actions have already been
eliminated for certain y variable labels at level 1 and are avoided.
If instead of stochastic actions all actions are deterministic, then HEXQ will find
one MER with 14 exits for the taxi y location variable at level 1. The graph for
this MER is shown in figure 6.9 (b). This time the effect of the primitive actions
of moving North and South is independent of the taxi x location and the MER
generalizes this result accordingly. At the second level in the HEXQ hierarchy one
MER consisting of 5 s2 states and 8 exits are found. This models navigation in the
x direction using the abstract actions of locating the taxi at a particular y location
Empirical Evaluation 118
X
Y
(a)0
1
2
3
4
(b)
(c)
0 3 421
0 1 2 3 4
0
1
2
3
4
Figure 6.9: Taxi domain with four variables, (a) x and y coordinates for the taxilocation, (b) the y variable MER for deterministic actions, (c) the three MERs fordeterministic actions when the x variable is forced to level 1.
first and then moving East or West4. HEXQ explores all 10 navigation exits at level
2. The rest of the hierarchy is again constructed as before.
In the final experiment with the four variable representation of the taxi domain,
the taxi x location and taxi y location variables are forced to be processed by the
HEXQ algorithm in reverse order. As discussed in the HEXQ theory this does not
effect the operation of the algorithm, other than in the efficiency of the decompo-
sition. With the taxi x location used at level 1, HEXQ finds four MERs shown in
figure 6.9 (c). As can be seen in the diagram, the MERs from left to right, have 5, 4,
4 and 9 exits respectively. The only MER with more than one state is the rightmost
one representing the taxi x values 3 and 4. As predicted by the theory, the problem
is decomposed successfully, but less efficiently.
Even with the better order of variables, if the criteria for combining levels from
4Tom Dietterich, in private communications, has pointed out that a designer could restrict thetermination of the y region East and West from value 2 alone for the navigation sub-task in aMAXQ graph and still achieve an optimum solution.
Empirical Evaluation 119
section 5.7 is used, ScombinedLevel = 150 < StwoLevels = 220 and hence HEXQ
would be better off combining these variables in this situation largely because of the
large number of exits generated as seen in figure 6.9. With stochastic actions the
MERs are single states and the case for combination is even stronger. All the above
variations for this problem lead to a globally optimal solution.
The taxi domain with 4 variables has shown that HEXQ is robust in the face
of alternative problem descriptions, enforced variable orderings and the efficiency of
the decomposition will vary as anticipated.
6.1.4 Fickle Taxi
Dietterich designed the fickle taxi task to show that a MAXQ recursively optimal
policy is not hierarchically optimal.
The destination location in the original taxi domain is chosen at random at the
start of the problem and remains unchanged during the trial. In this experiment
the problem is complicated by the passenger stochastically changing his or her mind
about the destination during each trial. A fickle taxi is specified with the passenger
changing the destination location with a probability of 0.3 on pickup. The MDP
again uses the three variables: passenger location, destination location and taxi
location.
If the HEXQ graph from the original taxi domain is used with the fickle taxi, the
hierarchical constraints produce a recursively optimal policy that is not as good as
the hierarchically optimal (and globally optimal in this case) policy. The abstract
actions learnt by HEXQ for the top level sub-MDP commit the taxi to put down
the passenger at the original destination location (see figure 6.5). For hierarchical
execution the taxi will complete this abstract action even though the destination
may have changed during the trial.
HEXQ is proven to be hierarchically optimal. Is this therefore a contradiction?
The answer is no. If HEXQ is allowed to rerun with the fickle taxi specification,
Empirical Evaluation 120
-2
-1
0
1
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
Time Steps
Ave
rag
e R
ewar
d p
er T
ime
Ste
p
Optimal Policy
Fickle Taxiusing HEXQ graph
constructed for non-fickle taxi
Fickle Taxi using HEXQ graph constructed for fickle taxi
Flat MDP for reference
Figure 6.10: Performance of the taxi with a fickle passenger compared to the orig-inal decisive passenger. HEXQ results do not include the time-steps required forconstruction of the HEXQ graph.
then HEXQ builds a different hierarchical structure which is again hierarchically
optimal and in this case globally optimal. Figure 6.10 shows the average reward
per primitive time step over 100 runs for the fickle taxi for (1) the flat MDP, (2)
using the original HEXQ graph of subsection 6.1.1 and (3) using the HEXQ graph
constructed with the fickle specification. In the latter case, at level 2 in the HEXQ
graph, HEXQ finds 5 MERs in contrast to one MER for the original HEXQ graph
(see figure 6.4). This is because a pickup action may change the destination location
creating an exit. The fickle HEXQ graph passes control back up to the top sub-MDP
on a pickup action which provides the opportunity for the taxi to choose to go to
the changed destination.
Dietterich (2000) specifies a different fickle taxi to the one used here in that the
destination location is changed after the passenger has moved one square away from
the passenger’s source location. This fickle taxi specification is not Markov in terms
of the original representation, as the stochastic change in destination is dependent
Empirical Evaluation 121
on a previous state, namely the source location of the passenger. HEXQ is based
on the assumption that the original problem is an MDP. It is for this reason the
above example does not use Dietterich’s fickle taxi specification. Nevertheless, using
Dietterich’s specification for the fickle taxi, HEXQ will also construct a different
hierarchy and perform with similar results and conclusions to those above. There
are then, however, small differences in the final converged optimal reward per time
step values between the flat learner and HEXQ which cannot be explained in a
Markov setting.
The original intention was to repeat, albeit with automatic decomposition, the
sub-optimal results obtained with MAXQ. The optimal solution produced by HEXQ
was at first unexpected. The surprise in this experiment was that HEXQ constructed
a different hierarchy for the fickle passenger variation of the problem, although in
retrospect this is easily explainable.
6.1.5 Taxi with Fuel
Another instructive example is the introduction of fuel to the taxi domain. The
objective here is to show how HEXQ solves the “hierarchical credit assignment”
problem (Dietterich, 2000) implicitly.
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 22 23
Figure 6.11: The Taxi with Fuel.
Empirical Evaluation 122
The taxi problem from section 6.1 is modified with the inclusion of a fuel tank.
The taxi consumes one unit of fuel for each primitive navigation action it takes.
If the taxi runs out of fuel before reaching the destination location and delivering
the passenger the taxi is refuelled with a penalty reward of -20. There is a refilling
station as shown in figure 6.11. A refilling action is added making the total number
of actions seven. When the refilling action is executed at the refilling station, the taxi
is refuelled to a fuel level of 12 and the usual reward of −1 is incurred. If the refilling
action is taken other than at the refilling station a reward of −10 is incurred with
no effect on the fuel level. The taxi is fuelled stochastically to a random fuel level
between 5 and 12 units of fuel inclusive prior to starting each trial. The introduction
of fuel requires a fuel level variable with the 13 values 0, 1, 2, . . . , 12.
Dietterich introduced this variation to the taxi problem to illustrate the hierar-
chical credit assignment problem in MAXQ. In MAXQ a Refuel subtask is added
to the hierarchy by the designer with the objective of moving the taxi to the refill-
ing station and refuelling. Unfortunately, the navigation subtasks in MAXQ which
move the taxi to the source and destination locations do not know how much fuel is
in the tank. The navigation sub-problem is non-Markov in that it is not possible to
predict from the taxi location alone when the taxi will run out of fuel and incur a
reward of −20. The solution in MAXQ is to manually separate the rewards received
for navigation from those received for exhausting fuel and assigning the latter to the
root task in the MAXQ hierarchy so that it can learn whether to invoke the Refuel
subtask. The problem of decomposing the reward and assigning the components
to different sub-tasks is referred to as the hierarchical credit assignment problem
(Dietterich, 2000).
The HEXQ solution is very different and avoids the hierarchical credit assignment
problem. Because HEXQ creates sub-MDPs the issue of a subtask not being Markov
never arises. HEXQ finds that the fuel level variable changes most frequently and
therefore chooses this for level 1 in the HEXQ graph. As the taxi location variable
Empirical Evaluation 123
may change with every fuel level change, each of the 13 fuel levels becomes a separate
MER. The 7 primitive actions are exit actions for each of these MERs, other than
for fuel level 12 which has 6 exits as the refilling action has no effect under any
circumstances. There is very little benefit having each fuel level state as a separate
MER. It would be more efficient to combine the fuel variable with the taxi location,
the next most frequently changing variable (see subsection 5.7). HEXQ at level 2
effectively takes the Cartesian product of the fuel levels and the taxi location to
create 13 × 25 = 325 s2 states. At level 2 HEXQ discovers 1 MER with 104 exits.
The exits are generated when the passenger is picked up or delivered at the 4 source
or destination locations with any of 13 levels of fuel (2× 4× 13 = 104). Level 3 in
the HEXQ graph has 5 s3 states representing the location of the passenger. They
form one MER with 52 exits to drop off the passenger at the various destinations
and with various fuel levels remaining in the tank. The top level represents the
destination location variable as before.
Figure 6.12 shows the taxi location × fuel level MER with some example transi-
tions identified. The set of sub-MDPs for this MER learns how to navigate around
the taxi world and simultaneously control the amount of fuel in the tank. The
top level sub-MDPs can then learn the most cost effective combination of the two
abstract actions of fuel efficiently picking up and dropping off the passenger.
Figure 6.13 compares the performance of HEXQ for the taxi with fuel and a flat
reinforcement learner. In this experiment the state variables are taxi location ×fuel level and passenger location × destination location. The navigation actions are
stochastic as in the original taxi example. The learning rate is set to 0.1 which, as
appendix A shows, allows the flat reinforcement learner to more closely approach
the optimal solution. A crude ε-greedy exploration policy is used with both HEXQ
and the flat learner. In HEXQ exploration is only required at the top level as the
other sub-MDPs are solved using value iteration after collecting transition statistics.
The exploration rate ε is reduced from 0.3 to 0.1 and finally to 0.0. The exploration
Empirical Evaluation 124
Taxi LocationFu
el L
evel
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
1
2
3
4
5
6
7
8
9
10
11
12
Refuelaction
navigationactions
pickup and putdownexit actions
Figure 6.12: The MER in the taxi with fuel problem showing the taxi location ×fuel level state space and some example transitions.
rates are dropped at slightly different times for each of the learners simply to help
distinguish the two graphs. To sort the variables in HEXQ the number of random
primitive time steps is set to 5000. At level 1 the number of times each state-action
combination is explored is set to 50 which ensures that all exits are discovered most
of the time. The reward per time step is averaged over each 1000 time steps for 100
runs.
After finding the fuel-navigation MER, HEXQ is able to improve performance
more quickly than the flat learner as it transfers the learnt policies for this sub-task
to other combinations of passenger and destination locations. There are 104 exits at
level 1 generating 52 sub-MDPs one for each hierarchical exit state. HEXQ therefore
requires an E table with 120, 380 entries in contrast to the flat learner which requires
only 45, 500 Q values.
For this example it is not critical that HEXQ discovers all the exits. If exits
Empirical Evaluation 125
-6
-5
-4
-3
-2
-1
0
0 1000000 2000000 3000000 4000000
Primitive Time Steps
Ave
rag
e R
ewar
d p
er T
ime
Ste
p
epsilon = 0.3
epsilon = 0.1
epsilon = 0.0
Hierarchical greedy execution
HEXQ
FLAT
Figure 6.13: Performance of the taxi with fuel averaged over 100 runs using stochas-tic actions and a two variable state vector. HEXQ attains optimal performance afterit is switched to hierarchical greedy execution
are missed for some fuel levels then the taxi agent will “work around” the problem
by finding a suitable alternative exit at a small cost in performance. This usually
means the taxi will burn fuel on the spot until the next exit is reached. Less exits
also means that less storage is required for the E tables. There is a performance
resource usage tradeoff.
When the navigation actions are stochastic, HEXQ, with hierarchical execution,
achieves a hierarchically optimal policy but cannot reach the optimal performance
of the flat learner. Stochastic navigation actions mean that it is more difficult to
reach a particular exit state which is defined in part by the amount of fuel in the
tank. It is easy for the taxi to randomly slip away from the target exit state and
then not be able to reach it as the fuel level can only decrease in most situations.
To improve this situation HEXQ is switched to hierarchical greedy execution mode,
as discussed in section 4.3, after learning. In this mode, instead of persisting with
one particular exit, the position is reevaluated after every primitive step and the
Empirical Evaluation 126
optimal action chosen using the decomposed value function. As shown in figure
6.13 the performance with hierarchical greedy execution is at least as good as that
attained by the flat learner.
From a designer’s perspective it would be desirable to abstract away the fuel level
from the navigation task. Without a model and without the benefit of background
knowledge, the taxi agent needs to learn the relationship between fuel usage and
navigation. This has been achieved successfully by HEXQ. Ryan (2002) tackles the
same issue, that of uncovering the cause of the side effect of running out of fuel when
navigating.
The main purpose of this example was to demonstrate that HEXQ avoids the
hierarchical credit assignment problem. Because HEXQ partitions the state space
into smaller MDPs each sub-MDP will, by definition, only model a Markov reward
function.
6.2 Twenty Five Rooms
The twenty five rooms example illustrates the operation of the Markov transition
condition of a HEXQ partition, that is, the discovery of exits when higher level
variables do not change value. In the first example deterministic actions are used
and MERs can be easily visualized. The second example shows the detection of an
internal room barrier modelled as either a hard border or as a high cost threshold.
The twenty five room problem consists of 25 interconnected rooms as shown in
figure 6.14. It is essentially a larger version of the rooms example used throughout
this thesis to illustrate basic concepts. The two variables describing the state are
position-in-room and room-number. The encoding assumes that each room is a
5 × 5 square of cells and the 25 rooms are also arranged in a 5 × 5 pattern. In
the first version, as can be seen in the figure 6.14, the actual rooms vary in size
with some of the walls missing or moved. Two of the rooms are divided by diagonal
Empirical Evaluation 127
walls. Despite these distortions in rooms, the labelling of the states is unchanged.
The numbers in this figure are the position-in-room value. The room values are
not shown. Rooms are labelled 0-24 left to right, top to bottom. An agent is
located at random anywhere in this environment and is required to move to a fixed
goal location somewhere else in the environment, in this case room-number 18 and
position-in-room 24. The deterministic navigation actions are North, South, East
and West and incur a reward of −1 on each step.
The position-in-room variable changes more frequently than the room-number
variable. Consequently the position-in-room is used at level 1 in the construction
of the HEXQ graph. HEXQ discovers the four MERs shown on the right hand side
in figure 6.14. It is easy to visualise and verify from the diagram that transitions
internal to each MER are identical for each of their aliased instances in the envi-
ronment. There are some exit transitions for which the room-number variable does
not change. An example is the transition from position-in-room 20 to 21. For this
transition the room-number does not change, but the transition is not Markov as
there are two instances near the bottom right of the environment where the tran-
sition is blocked by a wall. HEXQ discovers these exits by testing for a stationary
probability distribution as described under Same Context Condition in section 5.2.
The top level sub-MDP will have 100 states generated from the 4 MERs at level 1
and the 25 room-number variable values.
In the second version, shown in figure 6.15, there are two changes. The actions
are now stochastic. With 80% probability the agent will move as intended and 20% of
the time it will not move at all. Secondly, the rooms are all uniformly interconnected,
except that in one room (the one containing the goal location) a one way barrier is
inserted horizontally as shown. The barrier is specified in two separate ways: (1) a
one way wall is constructed that prevents the agent from moving South through the
barrier, and (2) the cost for making a transition across this line is increased from 1
to 100.
Empirical Evaluation 128
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
20
0
5
10
15
4
8 9
12 13 14
16 17 18 19
21 22 23 24
1 2 3
6 7
11
Figure 6.14: HEXQ partitioning of an MDP with 25 Rooms. The numbers indicatethe values for the position-in-room variable.
The numbers for each location in figure 6.15 indicate the final global optimum
value of each location as found by function value (table 5.4) from the decomposed
value function. The one-way border and the virtual barrier with high negative re-
wards caused HEXQ to find 5 exits using the binomial test and K-S test respectively
(see sections A.3 and 5.2.1) after executing each transition 5000 times. The success
of both of these tests is very sensitive to the nature of the stochastic actions. So for
example, when the agent slips to the left or right of the intended action, each with
a 10% probability, as for the stochastic taxi, not all the exits are discovered even
after exploring each transition 100,000 times.
There is a combination of factors that prevent the tests from finding exits reliably.
Even when transitions have a low probability, the states are connected by an edge
for the purposes of finding strongly connected components. If these low probability
edges are exits, considerable random exploration is required to build the necessary
Empirical Evaluation 129
44 43 42 41 40 38 37 36 35 34 32 31 30 31 33 32 33 31 30 29 28 26 25 26 2842 41 40 39 38 37 36 34 33 33 31 30 29 30 31 32 31 30 29 28 27 25 24 25 2642 40 39 38 37 36 35 33 32 31 30 29 28 29 30 31 30 29 28 26 25 24 23 24 2541 40 38 39 38 35 33 32 34 32 29 27 26 28 29 30 29 28 29 27 24 23 21 23 2440 38 37 38 39 33 32 31 33 34 27 27 25 26 28 30 28 27 28 29 23 21 20 22 2338 37 36 34 33 32 31 30 29 28 26 26 24 25 26 27 27 25 24 23 22 21 19 21 2237 35 35 33 32 31 30 29 28 27 25 24 23 24 25 26 25 24 23 22 20 20 18 19 2036 35 33 32 31 30 28 27 26 25 24 23 22 23 24 25 24 22 22 21 19 18 16 18 2034 34 32 33 32 29 28 27 27 26 22 22 20 21 23 25 23 22 22 22 18 17 15 16 1833 32 31 32 33 27 26 26 27 28 22 20 19 20 22 23 21 20 22 23 17 16 14 16 1732 31 29 28 28 26 25 24 23 22 21 20 18 19 20 21 20 19 18 17 16 15 13 14 1631 30 28 27 26 25 24 23 22 21 19 18 17 18 19 20 19 18 16 16 15 13 12 13 1429 29 28 26 26 24 23 21 20 19 18 17 15 17 18 19 18 16 16 14 14 12 11 12 1329 29 27 28 26 23 22 21 22 20 17 16 14 15 17 20 19 18 17 15 13 10 9 11 1227 26 25 27 27 22 21 20 21 22 15 14 13 14 16 22 20 19 18 17 11 10 9 10 1226 25 24 23 23 20 19 18 17 17 15 14 12 11 10 24 22 21 22 22 5 6 7 8 1025 24 23 22 21 20 18 17 16 15 13 12 11 10 9 24 23 22 23 24 4 5 7 7 825 24 22 21 19 18 17 16 15 13 13 11 10 9 7 6 5 3 3 1 3 4 5 6 825 24 23 22 20 19 18 16 16 15 13 12 11 10 9 5 4 2 1 0 4 5 6 7 826 26 25 23 22 21 19 19 17 16 14 14 12 10 9 4 3 1 0 5 6 8 9 1027 26 25 24 23 22 20 19 18 17 15 14 13 12 11 5 4 2 4 5 11 10 8 10 1126 25 24 22 21 21 19 18 17 16 14 13 12 11 9 6 5 4 5 6 10 11 10 11 1225 24 22 22 20 19 18 17 16 14 13 12 11 10 8 7 6 5 6 7 8 10 11 12 1326 24 24 23 22 21 19 18 17 15 14 13 11 11 10 9 7 6 7 8 10 11 12 13 1427 26 25 24 22 21 20 19 18 16 16 15 14 12 11 9 8 7 8 10 11 12 13 14 16
Figure 6.15: The optimal value for each state after HEXQ has discovered the one waybarrier constructed in the room containing the goal. The barrier was constructed intwo separate ways (1) by using a border that only allows transitions North and (2)a virtual barrier by imposing a reward of -100 for transitioning South. Both barriersproduced similar value functions.
sample size to conduct the tests to the level of significance specified. This can be
overcome by increasing the amount of exploration. A major impediment, however,
is that the characteristics of the problem may mitigate against collecting the right
statistics. Recall that these non-stationarity tests are based on samples that are
temporally close. The assumption is that the agent is likely to experience transitions
from one type of probability distribution in a concentrated set of experiences in time.
If the environment characteristics force the agent to other parts of the state space,
it may not be able to gather enough samples of a certain type of transition for the
tests to find exits. Relaxing the confidence levels of the tests would find too many
exits and the problem would not decompose efficiently.
A better approach may be to conduct the tests in the context of the higher
Empirical Evaluation 130
level variables for which the samples are known to come from only one stationary
distribution because the overall problem is assumed to be Markov. The number of
these contexts can be large, but it may be possible to select specific contexts that
improve the chances of finding exits. For example, if the agent becomes trapped,
it is a clear indication that an exit has been missed. The context that should be
tested is then the higher level variables defining the trapped situation. Section 9.1
expands on these proposals.
The Twenty Five Rooms problem has illustrated the discovery of exits when
higher level variables do not change value by testing for non-stationary state transi-
tion and reward functions for both deterministic and stochastic actions. The heuris-
tics for gathering samples were found to be inadequate to test for exits in some
stochastic settings. In many problems higher level variables will always change on
lower level region exits and these tests are not required.
6.3 Towers of Hanoi
The Tower of Hanoi problem is a puzzle that involves moving discs between pegs.
So far the Taxi, All Goals Maze and Twenty Five Rooms problems have involved
navigation. The Towers of Hanoi example should dispel any notion that HEXQ
is restricted to problems which are navigation dependent or require a metric state
space as an assumed inductive bias. This problem tests HEXQ with a larger number
of variables and therefore hierarchical levels. It will also demonstrate the potential
linear scaling in space complexity with the number of variables of a decomposed
MDP.
The puzzle is usually introduced with any number of different sized discs and
three pegs. The discs are placed in order of size on the first peg with the smallest
on top. The objective of the puzzle is to relocate the initial disc stack to the third
peg with the restrictions that the discs can only be moved from peg to peg one at
Empirical Evaluation 131
a time and at no time can a larger disc be on top of a smaller one.
To make the puzzle challenging this example will use 7 discs and three pegs. With
7 discs this problem has 37 = 2187 possible states. To see this, imagine allocating
the discs to the pegs. Each disc can be placed on any of the three pegs. There are 37
ways to distribute the 7 discs. Once the discs have been distributed to the pegs the
constraint that they are ordered by size on each peg uniquely determines a state.
Figure 6.16 (a) shows the initial configuration with all the discs on peg 1. Part (b)
of the figure shows the discs in one of many intermediate states and part (c) is the
goal state with all the discs on peg 3. The minimum number of moves required to
solve the 7 disc puzzle is 27 − 1 = 127.
Start Goal
(a)
(b)
(c)
discs
pegs
0 1 2
Figure 6.16: The Tower of Hanoi puzzle with seven discs showing (a) the start state,(b) an intermediate state and (c) the goal state.
The Towers of Hanoi puzzle can be represented as a 7-dimensional MDP where
each variable in the state vector refers to one of the discs and the variable value
indicates the peg number on which the disc sits. For example, the three legal states
of the puzzle from left to right in figure 6.16 may be encoded (0, 0, 0, 0, 0, 0, 0),
(2, 2, 1, 1, 1, 0, 2) and (2, 2, 2, 2, 2, 2) respectively with the variables ordered in as-
cending disc size. The variables could of course be listed in any arbitrary order in
the state vector, as HEXQ will first reorder them using the frequency of change
order heuristic of the algorithm. The actions of the MDP are defined in terms of
Empirical Evaluation 132
the from and to peg positions that a disc move is attempted. The six deterministic
actions are all of the form “move (top) disc on peg x to peg y”, where x 6= y and
x, y ∈ {0, 1, 2} and written movexy. For example, a next valid move from state (b)
in figure 6.16 may be to move the disc on peg 2 to peg 0 or move20. If an action
attempts to relocate a disc to a peg which contains a smaller size disc or the source
peg is empty, the move is illegal and fails with all discs staying in situ. All moves
or attempted moves incur a reward of -1.
In HEXQ the number of initial random actions to sort the variables by order of
frequency is set to 200,000. The number of random actions to explore exits at each
level in the hierarchy is set to 30. HEXQ sorts and then processes the variables
in order of disc size. This is not unexpected as it is always possible to move the
smallest disc and the probability of being able to move a disc decreases with size of
the disc. So the chances of moving the largest disc by applying random actions is
very small as it usually has other discs on top of it.
disc 1on peg 0
disc 1on peg 1
disc 1on peg 2
move02
move01
move20move10
move12
move21 move20
move02
move02
move01
move12
move21move10
move20
move21
move01
move10
move12
Figure 6.17: The directed graph for level 1 in the decomposition of the Tower ofHanoi puzzle showing one MER and six exits.
At each level of the hierarchy below the top level HEXQ will find one MER
and six exits. The directed graph for the first level is shown in figure 6.17. The
Empirical Evaluation 133
nodes represent the state values of the smallest disc, disc 1, and each of the edges is
labelled with the possible actions. For example, if the smallest disc is on peg 0 then
action move02 relocates it to peg 2. Action move12 may change the location of a
larger disc depending on the state of the puzzle. Therefore move12 executed when
the smallest disc is on peg 0 becomes an exit action as it violates the same context
condition of a HEXQ partition. The exits are:
(s1 = 0, a = move12)
(s1 = 0, a = move21)
(s1 = 1, a = move02)
(s1 = 1, a = move20)
(s1 = 2, a = move01)
(s1 = 2, a = move10)
There are six exits with three hierarchical exit states. This means that 3 sub-
MDPs will be generated for level 1 and six abstract actions available at level 2.
The directed graph for level 2 is shown in figure 6.18 with the six abstract
actions labelling the edges. The interpretation is a little more complex this time.
For example, with the second largest disc 2, on the leftmost peg 0, abstract action
(2, move01) will move the smallest disc to peg 2 and then move the disc on peg 0 to
peg 1. This will always have the effect of moving the second smallest disc from peg
0 to peg 1 because the only disc that could prevent this move, the smallest disc, has
been moved safely out of the way to peg 2. Again with disc 2 on peg 0, abstract
action (2,move10) will move the smallest disc to peg 2 and then attempt to move
the disc on peg 1 to peg 0. This will not succeed as any disc on peg 1 will be larger
than the second smallest disc occupying peg 0. Hence the node transitions to itself
Empirical Evaluation 134
disc 2on peg 0disc 1 on 0, 1 or 2
disc 2on peg 1disc 1 on 0, 1 or 2
disc 2on peg 1disc 1 on 0, 1 or 2
(1, move02)
(2, move01)
(1, move20)(2, move10)
(0, move12)
(0, move21) (1, move20)
(1, move02)
(1, move02)
(2, move01)
(0, move12)
(0, move21)(2, move10)
(1, move20)
(0, move21)
(2, move01)
(2, move10)
(0, move12)
Figure 6.18: The directed graph for level 2 in the decomposition of the Tower ofHanoi puzzle showing one MER and six exits.
for this abstract action. The abstract action (0,move12) will move the smallest disc
to peg 0. This means that the smallest and second smallest disc are both on peg
0. A disc move from peg 1 to peg 2 may relocate any other disc depending on the
state of the puzzle and hence (s2 = 0, (s1 = 0, a = move12)) is an exit at level 2 in
the HEXQ graph. There are six exits at level 2 shown in figure 6.18 that become
abstract actions at the third level.
The pattern repeats itself at each hierarchical level5. For example, to move the
i th largest disc from peg 0 to peg 2, all the smaller sized discs need to be located
out of the way on peg 1. Abstract action (1, (1, . . . (1,move02) . . .)) nested to a
depth of i − 1 will achieve this result. The total number of Q values required by
a flat reinforcement learner for the 7 disc puzzle is 2187 states times 6 actions or
13122 values. HEXQ only requires a total 342 E values to represent the same value
function in decomposed form. In general, given a Towers of Hanoi puzzle with d
discs the number of states is 3d and the number of actions 6. A flat reinforcement
learner will require 3d × 6 Q values. The Q table grows exponentially with the
5An interesting idea would be to introspect after building the hierarchy to discover recursiverelationships, as for example are evident between figures 6.17 and 6.18.
Empirical Evaluation 135
number of discs. A HEXQ decomposition in comparison has 3 (abstract) states and
6 (abstract) actions at each level. Three sub-MDPs are required at all levels except
the top one. HEXQ therefore only requires a total of 3 × 6 × 3 × (d − 1) + 3 × 6
E values to represent the value function. For HEXQ the storage requirements grow
linearly with the number of discs. Theorem 4.8 also guarantees that HEXQ will find
an optimal solution for any Towers of Hanoi puzzle.
100
1000
10000
100000
0 100 200 300 400
Trials
Ste
ps
to S
olu
tio
n
HEXQ
Flat
Optimum 127 steps
Figure 6.19: The performance comparison of a flat reinforcement learner and HEXQon the Tower of Hanoi puzzle with 7 discs.
The performance of a flat reinforcement learner (one step backup) is compared
to HEXQ for the 7 disc Towers of Hanoi puzzle in figure 6.19. The graph shows
the number of disc moves required to solve the puzzle for successive learning trials.
The results are averaged over 100 runs. The learning rate for both learners is set
to 1. The flat learner requires an exploration strategy to allow it to converge to the
optimum value. An ε-greedy exploration was used with ε set to 30% until trial 400.
Empirical Evaluation 136
Even this did not always guarantee that the flat learner would find the optimum
solution. The HEXQ results include the automatic ordering of the variables and
construction of the hierarchy. HEXQ solves the problem in about 30 trials compared
to 400 for the flat learner. In terms of total primitive actions or disc moves HEXQ
completes the problem in about 14% of the number required by the flat learner. For
the current implementation of HEXQ the number of exploratory random actions
required to sort the variables initially will increase exponentially with the number
of discs as sample movements of the larger discs will become less probable. It is of
course possible to delay the ordering of some of the variables until the lower levels
of the HEXQ hierarchy have been constructed and collect more ordering statistics
for higher level variables at the same time that exits are explored at each level.
The two constraints that remain to solve the Tower of Hanoi puzzle in linear time
complexity with the number of discs is that the number of disc moves to effect an
abstract action increases exponentially with level and the determination of the value
of a state from the decomposed value function is a depth first search. The former is
a feature of the mechanics of the puzzle and cannot be improved.
The latter issue was raised in subsection 5.4. With seven levels in the HEXQ
hierarchy the depth first search to find the value of states and the best action to
take becomes noticeably more expensive. One solution is to limit the search to a
specified depth. Intuitively this heuristic makes sense in that one would expect to
capture most of the value of a state at the higher levels in the decomposition where
the greater abstraction exists. This issue will be taken up again in Chapter 8.
The Tower of Hanoi puzzle can be made more challenging with the introduction
of stochastic actions. Disc moving actions perform as intended 80% of the time,
10% of the time the disc is mistakenly replaced on the same pin from which it was
removed and 10% it is mistakenly placed on the other unintended pin. With the
MDP representation defined as above HEXQ will decompose and solve the problem
with similar results and in a similar way to the problem with deterministic actions.
Empirical Evaluation 137
disc 1on peg 0
disc 1on peg 1
disc 1on peg 2
move02 0.8 move01 0.1
move01 0.8move02 0.1
move20
move10
move12
move21move20
move02
move02
move01
move12 0.8move10 0.1
move21 0.8move20 0.1
move10 0.8move12 0.1
move20 0.8move21 0.1
move21move01
move10
move12
move02 0.1 move01 0.1
move12 0.1move10 0.1
move21 0.1move20 0.1
Figure 6.20: The directed graph for level 1 in the decomposition of the stochasticTower of Hanoi puzzle showing one MER and 12 exits.
One MER is discovered per level, except this time each MER will have 12 exits which
generate 3 sub-MDPs per level. The directed graph with all the edges labelled with
their actions and their probability for level 1 of the HEXQ hierarchy is shown in
figure 6.20. Exit actions are shown as arrows leaving exit states.
This domain demonstrated HEXQ successfully constructing a seven level hier-
archy with three sub-MDPs per level. The space complexity grows linearly with
the number of variables in contrast to a flat learner where the state space grows
exponentially. With seven levels the best first search to find the value of a state and
the next best greedy action becomes prohibitive. It is possible to limit the depth
of the search for this puzzle and still obtain an optimal solution. A future research
direction is to characterise problems that allow good approximations when limiting
the depth of search.
6.4 Ball Kicking - a larger MDP
The final example illustrating the operation of the HEXQ algorithm is an example
inspired by the University of New South Wales’ success over the last five years in the
Empirical Evaluation 138
3 12
4
Figure 6.21: A simulated stylised bipedal agent showing its various stances.
international RoboCup soccer competition (Hengst et al., 2001). While this stylized
problem of a simulated humanoid learning to kick a soccer ball into the goal is a long
way from the ultimate RoboCup objective, it is designed to illustrate the ability of
HEXQ to automatically decompose and solve a larger MDP.
The agent in this case is a simulated stick figure humanoid robot as illustrated
in figure 6.21. Each foot of the agent is assumed to be able to move independently
to any one of 4 positions shown on the bottom left in the figure. This makes a total
of 16 possible states representing the stance of the two legs shown on the right of
the figure. Sixteen primitive actions are defined for the robot to assume any stance.
The soccer field is discretised into hexagonal cells. In order for the humanoid to
move over a cell and to another cell, it must assume precisely the series of stances
at four positions along each cell as indicated in the left of figure 6.22. The possible
cell positions and robot directions are indicated on the right. The series of stances
is meant to represent a cycle in a walking gate. As the humanoid is free to move its
legs to any stance at each of the four positions across the cell the state space has
Empirical Evaluation 139
(d)directions
positions
(i)
1 2 3 4
(ii)
1 2 3 4
(iii)
1 2 3 4
(iv)
1 2 3 4
12
3
4
Figure 6.22: The four stances (left) that comprise a successful traversal of a hexag-onal cell (right). Each of the six directions has 4 associate positions across the cell.One set is illustrated.
now grown to 16× 4 = 64 possible states.
Six primitive turning actions allow the humanoid to change the direction in which
it is facing. A turning action is only possible when changing from leg stance (ii) to
leg stance (iii) in figure 6.22. It can attempt to change direction in any cell position.
Changing direction is only effective when executed by the robot in an inner position
on the hexagonal cell and when facing towards the middle of the cell. An effective
turn will land the humanoid in position 3 facing in the direction dictated by the
action. Effective turning also needs to be discovered by the agent. With the six
directions added to the state description of the stance, the size of the state space
is 64× 6 = 384. The number of primitive actions has increased to 16 + 6 − 1 = 21
(one stance change action is performed simultaneously with the turning actions).
The humanoid agent does not have a model of the effect of the 21 actions when
in any of the 384 stances6. These must be discovered. Only the foot positions, the
6The walking model of this simulation is based on a similar idea in Uther (2002)
Empirical Evaluation 140
kick direction
random ball position after kick
Figure 6.23: The stylized soccer field illustrating the stochastic nature of the soccerball.
direction the robot is facing and the cell positions are represented. The position
and dynamics of rest of the body of the robot, including the arms, can be animated
in synchronisation with the feet for visual effect, but do not play a role in the
humanoid’s behaviour.
The soccer field is modelled by 20×30 = 600 hexagonal cells with each goalmouth
4 hexagonal cell wide as shown in figure 6.23. The ball is assumed to be located on
a hexagonal cell. It is “kicked” when the robot leaves the cell on which the ball is
located. The ball is kicked in one of the six directions that the robot is facing and
lands randomly between three and five cells forward and one cell to either side as
shown in figure 6.23 a the next time step. If the soccer ball is kicked outside the
field it is replaced at the edge of the field where it crossed the boundary.
A 3-dimensional MDP is defined with the variables: robot stance (384 values),
ball location (600 values) and robot location relative to the ball (2501 values).
The robot location relative to the ball is represented as a cell position relative
Empirical Evaluation 141
Figure 6.24: The deictic representation of the location of the ball relative to theagent.
to the ball as shown in figure 6.24. This is a deictic variable (Ballard et al., 1996)
and defines a relative coordinate frame avoiding the unwanted variance that would
otherwise be introduced by describing the robot’s position on field. Its values are
allowed to range from −30 to +30 horizontally and from −20 to +20 vertically.
While the ball is restricted to stay within the soccer field boundary the robot can
run outside the boundary to approach the ball. Both the robot and the ball are
initially (and after each trial) placed independently and randomly anywhere on the
soccer field. The reward is −1 for each primitive action. The trial terminates when
the robot has kicked the ball into the blue goal (right hand side).
The total state space size is therefore 384×600×2501 = 576, 230, 400. Together
with the 21 primitive actions, a flat reinforcement learner would require a Q table
size of 576, 230, 400× 21 or over 10 billion table entries.
HEXQ orders the variables using an initial 10, 000 random actions. Not surpris-
ingly the robot stance variable changes most frequently, followed by its ball relative
Empirical Evaluation 142
Figure 6.25: The HEXQ graph for the ball kicking domain. Level 1 regions learn to“walk”, level 2 regions learn to kick the ball and the top level learns to kick goals.
position. The ball location on the field is the least changing variable as the ball only
changes position on the field once kicked.
The HEXQ graph generated to solve this problem is illustrated in figure 6.25. At
level 1 in the HEXQ hierarchy one MER is found with 6 exits. The exits occur when
the robot changes hexagonal cell or the ball moves. There are 6 exits, one for each of
the possible directions that the robot is facing when on the last position on the cell.
The MER at level 1 is a state space in which the robot can change its leg positions
and direction. It is able to reach any adjacent cell with the correct sequence of
primitive moves. The six sub-MDPs generated, learn the optimal policy to move
the legs and change direction to allow the robot to choose to move to any of the 6
adjacent cells. An abstract action at level 2 is therefore seen to be a movement of
the legs through a cycle, moving the robot along the ground and possibly changing
direction in the process. It has the appearance and effect of an animated cartoon.
At level 2, HEXQ also finds one MER with six exits. This time the exits indicate
Empirical Evaluation 143
Table 6.4: The number of E action-value table entries required to represent thedecomposed value function for the soccer player compared to the theoretical numberof Q values required for a flat learner.
Level States Actions Exits sub-MDPs E/Q values
FLAT 576,230,400 21 12,100,838,400
1 384 21 6 6 48,384
2 2,501 6 6 6 90,036
3 600 6 3,600
Total HEXQ 142,020
that the ball location has changed. There are six exits because each of the six
abstract actions can “kick” the ball in different directions. The policies at level 2
translate to abstract actions at level three that have the effect of approaching the
ball and kicking it in any one of six possible directions. The top level sub-MDP
finally learns to kick the ball into the blue goal.
The number of table entries to store the decomposed action-value function E is
calculated in table 6.4. The saving in storage requirements is close to 5 orders of
magnitude. With each exploration set to 1 for each transition on the first two levels
of the hierarchy HEXQ will decompose and solve this problem in less than a minute
on an 400MHz Pentium desktop computer. The straightforward flat formulation of
this problem is intractable and no comparative results can be given. By corollary
4.8 from section 4.3.1 the HEXQ solution is globally optimal.
6.5 Conclusions
This chapter has empirically evaluated the HEXQ algorithm on a variety of deter-
ministic and stochastic shortest path problems to test its automatic decomposition
and solution properties. The experiments have shown that HEXQ:
• may find Markov equivalent sub-spaces in the global problem. The amount
Empirical Evaluation 144
of exploration required to discover all exits is currently tuned manually. A
solution to this issue is further discussed in Chapter 9.
• may decompose some stochastic shortest path problems into a hierarchical
structure of smaller sub-MDPs and solve the overall problem.
• constructs multiple levels of hierarchy.
• decreases the computational complexity in both storage and learning time
from exponential to linear in the number of variables in some cases. Once
solved, the execution time for HEXQ increases exponentially with the number
of variables. Solutions to this problem are discussed in Chapter 8.
• is robust against a forced change in variable ordering and different number of
variables to describe the same problem.
• will reconstruct the hierarchy to insure the solution remains hierarchically
optimal if the problem characteristics are changed.
• will find a globally optimal solution for deterministic transitions or when tran-
sitions are stochastic at the top level only.
• avoids the hierarchical credit assignment problem as rewards on exit are mod-
elled at higher levels where they can be explained. All regions are Markov,
insuring no reward surprises.
• will decompose problems when different state transition and reward function
behaviour is found for different higher level variable values. Exits need to be
detected when higher level values do not change value. For stochastic problems
the temporally close sampling heuristics currently employed are found to be
sensitive to the stochastic nature of the environment. A solution is proposed
and left for future implementation.
Empirical Evaluation 145
The results presented in this Chapter are consistent with the HEXQ decompo-
sition theory and the requirements of the HEXQ algorithm. The automatic decom-
position and scaling potential of HEXQ have been successfully demonstrated.
Chapter 7
Decomposing Infinite Horizon
MDPs
The thesis to this point has considered the automatic decomposition of stochastic
shortest path problems. Stochastic shortest path problems are episodic, meaning
that the task will eventually terminate. This chapter will extend the HEXQ auto-
matic decomposition approach to infinite horizon Markov decision problems. These
are MDPs for which a policy may require an agent to continue executing a policy
indefinitely.
Infinite horizon control policies are quite common. Examples include controlling
a car to stay in a lane and pole balancing. It is easy to construct an infinite horizon
version of the taxi task introduced in Chapter 6. An infinite horizon specification
could simply generate a new passenger and destination after each passenger drop-off
and require the taxi to continue to deliver new passengers forever.
Infinite horizon MDPs present three issues for HEXQ:
• retaining state abstraction with a discounted value function,
• guaranteeing the termination of sub-MDPs and
• allowing for the possibility that an optimal policy may require the agent to
146
Decomposing Infinite Horizon MDPs 147
continue execution in a lower level sub-MDP.
The first issue is caused by discounting. Recall that the value function of a state
is the expected sum of discounted future reward. A reward N steps in the future is
discounted by γN−1, where γ is a constant discount factor less than one. Discounting
is necessary to ensure that the value function remains bounded. Unfortunately, the
amount of discount applied to rewards after exiting a sub-MDP now depends on
how many steps it takes to reach the exit. In figure 7.1 the discount after exiting is
smaller from the perspective of state s1 than for state s2 because of the extra time
steps required to reach the exit.
sub-MDP ma
for region g
s2
s1
a
Exit value: E(s,a)
Figure 7.1: For a discounted value function, the amount of discount applied afterexiting the sub-MDP depends on the number of steps required to reach the exit.
One solution is to store separate exit values for each state of a sub-MDP. This
gives up state abstraction of the exit values and with it the scaling potential for
HEXQ1. State abstraction has been shown to be important for scaling in reinforce-
ment learning by other researchers (Dayan and Hinton, 1992, Kaelbling, 1993, Di-
etterich, 2000, Andre and Russell, 2002). The consequences of discounting in hier-
archical reinforcement learning with a single decomposed value function presents a
1Dietterich also discusses this issue under result distribution irrelevance (Dietterich, 2000),concluding that abstractions of this kind are only possible in an undiscounted setting [in MAXQ].
Decomposing Infinite Horizon MDPs 148
major impediment to scaling.
This chapter will show how this issue can be overcome with the introduction
of an additional decomposed discount function. This discount function, working
in concert with separate sub-MDP discounted value functions, allows safe state
abstraction in the face of discounting. The discount function stores the amount
of discount required on region exit for each state in the region. This approach
is a natural extension to the automatic decompositions produced by HEXQ and
can be applied to other hierarchical reinforcement learning methods with additional
constraints.
The second issue is to ensure the termination of sub-MDPs to generate proper
policies for exiting the region. The issue arises because HEXQ expects a sub-MDP
with an exit to find a policy that will actually use this exit. However, discounting
or the possibility of positive rewards may result in an optimum policy that does not
exit the sub-MDP. One solution is to artificially inflate the termination value at the
exit of a sub-MDP to overcome any reluctance to exit. For stochastic shortest path
problems this issue does not arise as the discount factor is one, all rewards inside
the sub-MDP are negative and the termination value is zero. For these problems
exiting is assured. One solution for infinite horizon problems is to define a large
positive pseudo exit value to overcome any positive rewards inside the sub-MDP or
effects due to discounting and thereby force an exit.
The final issue concerns the possibility that an optimum policy contains a non-
terminating abstract action. For stochastic shortest path problems the task always
terminates and all sub-MDPs policies are proper, meaning that they will eventually
exit. For infinite horizon problems it is possible that the best policy is for an
agent not to exit a sub-MDP but to continue to execute it forever. A hierarchical
reinforcement learning algorithm must create and allow an agent to explore such
continuing abstract actions.
As an example of a problem with a non-terminating abstract action policy, con-
Decomposing Infinite Horizon MDPs 149
sider the ball kicking agent from section 6.4, but this time with rewards constructed
to make it more profitable for the agent to run around the ball rather than kick
goals. Running around the ball is achieved by continuing in the second level region
that models the agent’s relation to the ball and ignores the ball’s location on the
field. This presents an execution dilemma. If the agent continues to run around the
ball it is prevented from exploring policies that involve kicking the ball.
The solution presented in this Chapter is for HEXQ to create non-terminating
sub-MDPs for each Markov equivalent region. These sub-MDPs are artificially
timed-out during learning, allowing higher level solutions to be explored. By allow-
ing non-terminating abstract actions, a recursively optimal solution can be found
for HEXQ decomposable infinite horizon MDPs, even when the solution requires
continuing in a sub-task.
Infinite horizon modifications to both the taxi domain and the ball kicker task
will be used to illustrate these extensions to HEXQ.
7.1 The Decomposed Discount Function
The aim of this section is to derive decomposition equations for a discounted value
function that retain the property of state abstraction. These equations are needed to
allow HEXQ to state abstract infinite horizon MDPs. This section retraces some of
the derivation steps for the HEXQ decomposition equations for stochastic shortest
path problems in Chapter 4, but this time with a discounted value function.
From Chapter 2 the value function for state s, as the expected sum of future
discounted rewards, is
V πm(s) = E{r1 + γr2 + γ2r3 . . .} (7.1)
where π is a stationary policy, m is a HEXQ sub-MDP, rt are the primitive rewards
received at each time step and γ is the discount factor.
Decomposing Infinite Horizon MDPs 150
In general sub-MDP m uses abstract actions to invoke lower level sub-MDPs
based on policy π. Allowing random variable N to represent the number of steps
required to exit the next lower level sub-MDP, ma, equation 7.1 can be written as
the sum of two series
V πm(s) = E{
N−1∑n=1
γn−1rn}+ E{∞∑
n=N
γn−1rn}. (7.2)
The first series is the local discounted value function for sub-MDP ma where the
termination is defined by the zero reward exit to an absorbing state. Isolating the
reward on exit in the second series gives
V πm(s) = V π
ma(s) + E{γN−1rN +
∞∑n=N+1
γn−1rn}. (7.3)
If s′ is the state reached after exiting the sub-MDP ma, R the expected reward
on exit to s′ after N steps and defining P πm(s′, N |s, π(s)) as the joint probability of
reaching state s′ in N steps starting in state s and following policy π, then equation
7.3 becomes
V πm(s) = V π
ma(s) +
∑
s′,N
P π(s′, N |s, π(s))γN−1[R + γV πm(s′)] (7.4)
Importantly, for a HEXQ partition, the termination state s′ reached and the
expected reward R on exit are independent of the number of steps, N , to reach the
exit. This is the case because (1) there is only one exit state defining each sub-MDP,
(2) that state is reachable with probability one, and (3) terminating the sub-MDP
is constrained via this one exit state. This is the key property that makes it possible
to abstract a whole region into an aggregate state2. Independence means that:
2Multi-time models for options (Precup, 2000) and similar approaches for MAXQ (Dietterich,2000), HAMs (Parr, 1998) and ALisp (Andre and Russell, 2002) are used to apply the correctdiscount on the termination of abstract actions. The important difference here is the guarantee of
Decomposing Infinite Horizon MDPs 151
P π(s′, N |s, a) = P π(s′|s, a) · P π(N |s, a) (7.5)
Equation 7.4 using abbreviation a = π(s) becomes
V πm(s) = V π
ma(s) +
∞∑N=1
P π(N |s, a)γN−1 ·∑
s′P π(s′|g, a)[R + γV π
m(s′)].
Definition 7.1 (Discount Function) The discount function Dπm is the expected
discount to be applied to the exit value of m, under policy π, where N represents the
random number of steps to exit.
Dπm(s) =
∞∑N=1
P π(N |s, π(s))γN−1 (7.6)
The discounted version of the exit action value function E (from Chapter 4) is
defined as follows:
Definition 7.2 (E Function) The HEXQ action-value function E (or exit value
function) for all states s in region g is the expected value of future discounted rewards
after completing the execution of the abstract action a exiting g and following the
policy π thereafter. E includes the expected primitive reward on exit, but does not
include any rewards or discounting accumulated inside g.
Eπm(g, a) =
∑
s′P π(s′|g, a)[R + γV π
m(s′)] (7.7)
were P π(s′|g, a) is the probability of transitioning to state s′ after abstract action
a terminates from any state s ∈ g and R is the expected final primitive reward on
transition to state s′ when abstract action a terminates.
Equation 7.6 is now succinctly written as:
state abstraction in addition to the correct discounting for abstract actions.
Decomposing Infinite Horizon MDPs 152
V πm(s) = V π
ma(s) + Dπ
ma(s) · Eπ
m(g, a) (7.8)
The discount function D is itself recursively represented. If N = N1 +N2, where
N1 is the random number of steps to exit sub-MDP ma reaching state s′ and N2 is
the balance of steps to exit sub-MDP m. N2 and s′ are independent of N1 and s.
From equation 7.6
Dπm(s) =
∑N1,N2
P π(N1|s, a)P π(N2|s, a)γN1+N2−1
= γ∑N1
P π(N1|s, a)γN1−1 ·∑N2
P π(N2|s, a)γN2−1
= γDπma
(s)Γπm(g, a) (7.9)
where
Definition 7.3 (Γ Function) The action-discount function Γ for all states s in
region g is the expected value of discount that is applied to state s′ reached after
completing the execution of the abstract action a exiting g and following the policy
π thereafter. Γ assumes that the value of the state following exit of sub-MDP m is
1 and all rewards leading up to exit are zero.
Γπm(g, a) =
∑
s′P π(s′|g, a)Dπ
m(s′) (7.10)
were P π(s′|g, a) is the probability of transitioning to state s′ after abstract action a
terminates from any state s ∈ g.
Equations 7.7, 7.8, 7.9 and 7.10 are the HEXQ decompositions equations for a
discounted value function following policy π. The optimum value functions, where
abstract action a invokes sub-MDP ma and state s is in region g, are
Decomposing Infinite Horizon MDPs 153
V ∗m(s) = max
a[V ∗
ma(s) + D∗
ma(s) · E∗
m(g, a)] (7.11)
D∗m(s) = γ max
a[D∗
ma(s) · Γ∗m(g, a)] (7.12)
This formulation requires only one value to be stored for functions E and Γ for
all states in sub-MDPs for region g. Safe abstraction of the sub-MDP’s states can
be retained as in the non-discounted case. This is achieved at the cost of storing
the separate on-policy action discount function Γ. The benefit is that decomposed
discounted value functions can, in the best case, scale linearly in space complexity
in the number of variables using HEXQ versus an exponential increase for the flat
problem. From equation 7.8 or 7.11 it can be seen that the top sub-MDP does not
need a discount function. The storage requirements only increase by less than twice
that required for an undiscounted episodic problem.
Figure 7.2 is a simple example of how an overall value function is composed for
an MDP with discounting. Figure 7.2 (a) shows a 2-dimensional (x, y) state MDP
with two identical regions with one exit to the right. The deterministic actions are
to move left and right and all rewards are -1 per step. The reward on termination
is 20. The discount factor is 0.9. The key point is that the composed value function
in 7.2 (b) exactly represents the original MDP value function by only storing the
exit value E for the two y variable labels at the top level sub-MDP (i.e. values 7.71
and 20.0).
A similar technique can be used with MAXQ. Termination of MAXQ sub-tasks
allows multiple exit states and for a similar treatment to work it is additionally
required that either all MAXQ sub-tasks have single exit states or that all states s′
reached after termination have the same exit value for any instance of the sub-task.
As each level of the HEXQ hierarchy is automatically constructed, the stationary
policy found is the recursively optimal policy for each sub-MDP. In section 4.3 it
Decomposing Infinite Horizon MDPs 154
x=0 x=1 x=2 x=3 x=4reward
20x=0 x=1 x=2 x=3 x=4
y=0 y=1
1.62 2.91 4.35 5.94 7.71 9.68 11.9 14.3 17.0 20.0
-3.44 -2.71 -1.90 -1.00 0.00
0.66 0.73 0.81 0.9 1.00
V(x)
D(x)
V(s)=V(x)+D(x)*7.71 V(s)=V(x)+D(x)*20
(a)
(b)
(c)
(d)
Figure 7.2: A simple example showing the state abstracted decomposition of a dis-counted value function. (a) shows a 2-dimensional MDP with two identical regionswith one exit to the right. The deterministic actions are move left and right, allrewards are -1. The reward on termination is 20. The discount factor is 0.9. (b) thecomposed value function for each state. (c) and (d) are the abstracted sub-MDPvalue and discount functions respectively.
was proven that the recursively optimal solution of a HEXQ decomposition is also
hierarchically optimal. This is no longer the case for a discounted value function
because the optimum value function and policy can be effected by the exit value of
a sub-MDP. Hierarchical execution therefore means a recursively optimal solution
for HEXQ.
In practice HEXQ determines the action discount function Γ by using dynamic
programming policy iteration in the same way that the local value function is cal-
culated during hierarchy construction. Function Execute (table 5.5 in Chapter 5) is
used to additionally update the discount function by performing on-policy updates
synchronistically with value function updates.
7.2 Infinite Horizon MDPs
This section explains how HEXQ is modified to solve infinite horizon problems.
It has been assumed that all sub-MDPs will terminate with probability one. This
Decomposing Infinite Horizon MDPs 155
is no longer the case because of discounting or the possibility of positive rewards.
To allow the HEXQ decomposition to work with all finite MDPs, termination of
sub-MDPs in the presence of positive and negative rewards needs to be guaranteed.
Even when all the rewards are negative, an optimal policy does not guarantee that
a sub-MDP will exit when using a discounted value function.
A pseudo exit function E that has a large enough positive termination value to
force any sub-MDP to exit is used. The pseudo exit function E is defined in the
same way as the exit function E, except that instead of a zero reward on exit, a
large enough positive reward is used to force the agent to exit the region3.
In summary there are three exit functions now for each sub-MDP in HEXQ. The
pseudo reward exit value function E determines the policies available as abstract
actions at the next level up in the hierarchy. The function Γ holds discount values
to be applied to the real exit value function at the next level up. The real exit
value function E holds the (in Dietterich’s words) “uncontaminated” exit values
for each sub-MDP. E, E and Γ are determined using dynamic programming during
hierarchy construction from the discovered model as for stochastic shortest path
problems. Subsequently E is updated on-line using temporal difference methods
and Γ and E are simultaneously updated following E’s policy in function Execute.
For infinite horizon problems it is of course possible that a recursively optimal
policy means that the agent may not wish to exit a sub-task at any level of the hier-
archy. Basic HEXQ does not provide for this option as every sub-MDP is designed
to exit. It would be a simple extension to allow the top sub-MDP to be continuing
(infinite horizon), but what about all others? Solutions that continue in a sub-MDP
below the top level are accommodated with the inclusion of a non-terminating policy
in each sub-task policy cache. The stay-in-region policy inclusion is also suggested
by Hauskrecht et al. (1998) for one of their methods for constructing a macro set
3A similar tick is used in MAXQ. In MAXQ the pseudo reward value function sets large negativevalues on undesirable terminations. These are not required in HEXQ because the HEXQ algorithmensures, by construction, that unwanted exits cannot occur.
Decomposing Infinite Horizon MDPs 156
(policy cache) over regions.
Recall that one sub-MDP is generated for each exit state in step 7 of the HEXQ
algorithm in table 5.1. Exiting is assured by the pseudo reward forcing function
which uses the incentive of a large positive reward on exit. To allow for the possibility
of a recursively optimal policy continuing in a sub-MDP, an additional sub-MDP is
created without an absorbing state for each region in the HEXQ partition. The stay-
in-region policy is induced by disallowing all exits for the sub-MDP. This creates
an abstract action at the next level and an extra policy in the policy cache that
continues in the sub-MDP forever.
To ensure that such a policy exists it is necessary to ensure that regions are
not forced to exit. The strongly connected components found by HEXQ have just
this property as all states can always reach each other. In the event of single state
regions, a no-exit policy is only generated if that state can transition to itself without
exiting4.
During learning, a time-out is now required for sub-MDPs. A non-exiting sub-
MDP will not return control to its parent and a means of interrupting the execution
of the sub-MDP is required. The number of steps that a sub-MDP executes is
counted and if it exceeds a threshold value the execution is interrupted and control
is passed back up the hierarchy to the top level without updating exit value functions.
7.3 Infinite Horizon Experiments
Modified versions of the taxi task from section 6.1 and the robot soccer player from
section 6.4 are used to provide empirical evidence that HEXQ is able to perform in
an infinite horizon setting and continue in a sub-task if required.
4Regions are not tested for potential combination as in section 5.2.2. The algorithm could beimproved by allowing regions to be combined if an exit is not forced on entry to the combinedregion. In the current implementation each strongly connected component is conservatively takento be a separate region.
Decomposing Infinite Horizon MDPs 157
7.3.1 Continuing Taxi
The taxi domain is modified by making the task continuing. The taxi is additionally
given the option of performing a local job that only requires the use of the navigation
sub-task. The objective is to compare the solutions found by HEXQ to the optimal
behaviour found by a flat reinforcement learner for various reward scenarios.
The taxi task is reproduced in figure 7.3. A taxi trans-navigates a grid world to
pick up and drop off a passenger at four possible locations, designated R, G, Y and
B. Actions move the taxi one square to the North, South, East or West, pickup or
putdown the passenger. Navigation actions have a 20% chance of slipping to either
the right or left of the intended direction. Generally all actions incur a reward of
−1. If the taxi executes a pickup action at a location without the passenger or a
putdown action at the wrong destination it receives a reward of -10. To make the
task continuing, another passenger location and destination are created at random
after the passenger is delivered and the taxi is mysteriously tele-ported to a random
grid location. The reward for delivering the passenger is 200.
The taxi domain is augmented with another source of income for the taxi. If
the taxi transitions from below to the grid location marked with the $ sign in figure
7.3, the reward is a positive number. This may be interpreted as a local delivery
run with guaranteed work but at a different rate of pay. The taxi problem can be
modelled as an MDP with a 2-dimensional state vector s = (taxi location, passenger-
source/destination). There are 25 possible taxi locations and 20 possible passenger-
source × destination combinations.
With all three value functions, pseudo, real and discount, HEXQ requires 25
states × 5 sub-MDPs × 4 primitive actions × 3 value functions = 1500 table entries
at level 1. At the top level there is one sub-MDP(semi-MDP) with 20 states ×9 abstract actions × 1 real value function (the others are redundant) = 180 table
entries, a total requirement of 1680 table entries compared to 3000 for the flat
learner. These storage requirements assume actions with no effect are eliminated
Decomposing Infinite Horizon MDPs 158
(section 5.7). The pseudo value function is not required for execution of a learnt
policy and can be discarded after learning, reducing the storage requirement for the
decomposed value function to 1180 table entries. For larger problems the savings
can become more significant.
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
0.4000
-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Local Positive Reward "$"
Ave
no
of
visi
ts t
o lo
c $
per
tim
este
p
HEXQ
flat
The Taxi Task
$
Figure 7.3: The infinite horizon taxi task. The graph shows that HEXQ finds andswitches policies similarly to that of a flat reinforcement learner for various values ofpositive reward at $. As the reward at $ increases the taxi stops delivering passengersand instead visits the $ location as frequently as possible. This provides confirmingevidence for the correct operation of HEXQ for an infinite horizon problem, evenwhen the optimum solution requires continuing in the non-exit navigation sub-MDP.
This problem has two optimal solutions depending on the value of the $ reward.
If the $ reward is low, the solution is to continue to pick up and deliver passengers as
per the original taxi problem. For larger $ values the taxi prefers to continue local
delivery runs and ignore the passenger. As the local $ reward is the same in the
context of any passenger pickup and destination location, HEXQ discovers one class
of region for navigation. Performing local deliveries means deciding to continue in
a navigation sub-task.
It is instructive to vary the local reward at location $ to see when the taxi decides
to perform local deliveries instead of picking up and dropping off passengers. The
local reward is varied from −1 to 19 in increments of 1 and the MDP solved both
Decomposing Infinite Horizon MDPs 159
with HEXQ and with a flat RL. The number of times the $ location is visited
per time step is counted as an indication of which strategy the taxi uses. A high
visitation rate indicates the taxi prefers the local delivery run.
A HEXQ automatic decomposition of the original episodic taxi task creates 4 sub-
MDPs at level 1 (see section 6.1.1). For the continuing taxi task we still only have
4 exit states at level 1, but an extra non-exiting sub-MDP is created as explained
previously. Figure 7.3 indicates the optimal policy chosen for each value of local
reward $. As the amount of local reward increases the taxi switches strategy from
delivering passengers to local delivery runs. The switch takes place when the local
reward is about 10 for both HEXQ and the flat learner. The error bars indicate the
maxima and minima over ten runs for each reward setting.
Within error bounds, both learners find similar solutions for this problem, pro-
viding confirming evidence that HEXQ can find the correct solution by persisting
in a sub-task if necessary.
7.3.2 Ball Kicking
The second example is the stylized bipedal robot introduced in section 6.4, that
learns to walk, kick a ball and score goals. This problem would be intractable with
a flat learner on present day desktop computers requiring nearly 3 billion Q values.
The experiment is designed to demonstrate the benefit of state abstraction in a
discounted setting.
The MDP state description contains three variables: the robot leg stance and
direction (384 values), the position of the robot relative to the ball (861 values) and
the position of the ball on a soccer field (400 values). 21 primitive actions allow
the robot to move the legs and change its direction. The robot leg positions, soccer
field and ball behaviour is shown in figure 7.4 (a) and (b). The reward on each state
transition is −1 and a trial terminates when a goal is scored. When the robot runs
into the ball, the ball is kicked stochastically but roughly in the direction the robot
Decomposing Infinite Horizon MDPs 160
is moving. HEXQ automatically generates a 3 level hierarchy. A similar problem
is described more fully in section 6.4. The primary interest is to test HEXQ when
positive reward is introduced. In separate experiments, positive reward is introduced
to
1. reward the robot for running on the spot and
2. reward the robot for running around the ball.
The significance of these conditions is that recursively optimal solutions can only be
found by the robot continuing in a level 1 sub-task and level 2 sub-task respectively.
3 12
4
(a) (b)
(c)
7.87.8
12.27.8
12.27.8
12.27.8
12.2
7.812.2
12.217.0
17.021.8
17.021.8
17.021.8
12.217.0
17.021.8
21.826.8
26.8
26.826.8
12.212.2
17.017.0
21.821.8
26.821.8
26.821.8
7.812.2
17.017.0
21.821.8
21.817.0
7.87.8
12.212.2
17.017.0
17.012.2
7.87.8
12.212.2
12.27.8
7.87.8
12.212.2
17.012.2
17.012.2
17.012.2
17.012.2
12.27.8
7.8
7.8
7.8
7.8
7.8
7.8
7.8
(d)
Figure 7.4: The soccer player showing (a) the simulated robot leg positions, (b) 400discrete ball locations on the field, (c) the discounted value of states in the level 2no-exit sub-MDP when the robot is rewarded for running around the ball and (d) asnapshot of the robot running around the ball.
Using a learning rate of 0.25, a discount rate of 0.99 and an ε-greedy exploration
Decomposing Infinite Horizon MDPs 161
policy for the top level sub-MDP with ε = .5 for the first 500 trials, HEXQ generates
a total of 17 sub-MDPs to decompose the problem. At level 1, 6 sub-MDPs deter-
mine the one step walk directions plus one no-exit sub-MDP. At level 2, unlike in
section 6.4, there are two regions, one with 860 states and the other with one state
when the agent is at the ball location. The first region has 7 sub-MDPs representing
the 6 directions from which to approach the ball and one no-exit MDP. The single
state region has two vestigial sub-MDPs, one allowing abstract actions to kick in
one of six directions and a no-exit policy. There are two regions as a direct result
of conservatively taking strongly connected components as regions. They would be
merged into one region with the suggested improvement in the footnote of section
7.2.
After exploring the value function for scoring goals, the agent decides to continue
in the no-exit level 1 sub-MDP running on the spot for case (1). For case (2) the
agent continues in the level 2 no-exit sub-MDP for the larger region, running around
the ball as shown in figure 7.4 (d). In this case it invokes terminating sub-MDPs
from level 1 for locomotion. Figure 7.4 (c) shows the discounted value function for
a portion of the no-exit sub-MDP for the larger region at level 2. Six maximum
value states determine the circling policy. For this sub-MDP policy the sequence of
primitive rewards is 4, -1, -1, -1, 4, -1, -1, -1, 4, -1, .... Executing its stylised walking
gate, the agent makes 4 leg movements to take one step. The reward is 4 when it
changes position relative to the ball and -1 otherwise. With a discount factor of .99
the E value of each state in the final cycle is the discounted sum of the above reward
sequence and is equal to 26.8 truncated to one decimal place.
HEXQ solves this problem in about a minute, on a 400MHz Intel desktop com-
puter, with theoretically over four orders of magnitude saving in space requirements
as a result of state abstraction made possible with the additional action discount
function. The results support the expected operation of the HEXQ algorithm to au-
tomatically decompose and solve an infinite horizon MDP with positive and negative
Decomposing Infinite Horizon MDPs 162
rewards.
7.4 Conclusion
The hierarchical reinforcement learners MAXQ and HEXQ employ a hierarchically
decomposed value function. When the value function is represented by the dis-
counted sum of future rewards it is not possible to directly abstract sub-task state
values because the amount of discount to apply to the completion or exit values
is not known. The introduction of a separate on-policy discount function for each
sub-task has solved this problem for HEXQ and conditionally for MAXQ. The in-
troduction of non-exiting sub-MDPs that are interrupted during learning and the
extension of on-policy pseudo reward functions with large positive rewards on exit
has allowed HEXQ to automatically decompose and solve infinite horizon MDPs
with positive rewards.
The discount function gives the amount of discount to apply at termination or
exit for each state in the sub-task. In HEXQ it is possible to safely abstract the exit
values for all states of a sub-MDP because exits are unique.
A decomposed discounted value function and state abstraction extends the effi-
cient application of this type of hierarchical reinforcement learning from stochastic
shortest path problems to infinite horizon MDPs.
Chapter 8
Approximations
In many practical situations an approximate solution is all that may be required. Ap-
proximations can often reduce computational complexity significantly. This Chapter
reports on preliminary work using approximations in the application of HEXQ to
problems where some additional background knowledge is assumed. This can lead
to savings in storage and computation time over and above that already achieved
by HEXQ, while retaining the ability to automatically construct a hierarchy. In
some cases, approximations can improve the policy found by HEXQ or even make a
problem HEXQ decomposable when the problem was not decomposable beforehand.
Section 4.4 illustrated the compact representation of a HEXQ tree by a directed
acyclic graph using safe state abstractions. The HEXQ value function for a hier-
archical policy is exactly the same before and after state abstraction. This section
introduces approximations to a decomposed value function that go beyond safe state
abstraction of a hierarchically optimal value function. Two types of approximations
are explored in relation to stochastic shortest path MDPs
• variable resolution approximations,
• stochastic approximations.
163
Approximations 164
8.1 Variable Resolution Approximations
Not only do humans have the ability to form abstractions, but they control the
amount of detail required to represent complex situations and to make decisions.
To decide the best way to travel to a conference, for example, the first component of
the decision may involve the available modes of transport; car, bus, aircraft or train.
The final decision may even take into consideration connections at either end for each
mode of primary transport. However, the way to leave the home (front door, back
door or garage door) is unlikely to be an influencing factor in the decision, although
the final execution of the plan will use one of the doors to exit the home. Planning
the whole journey to the lowest level of detail is unnecessary and an approximation
limited to the first two levels may be sufficient. How could a reinforcement learner
model these pragmatics?
The HEXQ decomposition produces a hierarchy of abstract models with increas-
ingly finer resolution at successive levels further down the hierarchy. For example,
in the simple maze from Chapters 1 and 4, the top level models the coarser grained
inter-room navigation, whereas the bottom level represents the higher resolution
intra-room navigation from position to position. In the Towers of Hanoi problem,
higher levels in the hierarchy model the movement of larger discs. The details of
moving the smaller stack of discs, that sits on top of the larger discs, is left to the
lower levels in the hierarchy.
A natural approximation at each level of a hierarchy is to ignore the details at
the lower levels in the hierarchy. The depth in the hierarchy to which the details
are taken into account can be varied, providing a mechanism whereby the accuracy
of the approximation can be controlled. A number of heuristics are presented to ap-
proximate the decomposed value function both at construction time of the hierarchy
and during final execution of the hierarchical policy.
Approximations 165
8.1.1 Hierarchies of Abstract Models
The HEXQ decomposition produces a hierarchy of abstract models at progressively
finer resolutions.
Consider a HEXQ decomposition to level e with abstract states se ∈ Se. The
Cartesian product of Se and the rest of the frequency ordered variables Xe+1, . . . , Xd
from the original MDP, form a partition Ge over the base level states at level e.
Ge = Se ×Xe+1 ×Xe+2 × . . . ×Xd (8.1)
Under this interpretation, the HEXQ decomposition produces a series of progres-
sively coarser partitions, Ge, at each successive level. Each Ge is a quotient partition
with respect to the next lower level e− 1. The blocks of partition Ge represent the
states of the abstracted SMDP at level e. The HEXQ decomposition of the original
MDP is hence modelled at progressively finer levels of resolution proceeding from
the top to the bottom level of a HEXQ hierarchy.
The HEXQ decomposition provides a structure and mechanism to approximate
solutions to various levels of detail. Recall that the decomposed hierarchically op-
timal value function for the abstract SMDP at level e (equation 4.15 from Chapter
4) for state s and ge ∈ Ge, where (abstract) action a invokes sub-MDP ma is
V ∗e (s) = max
a[V ∗
ma(s) + E∗
e (ge, a)] (8.2)
Before proceeding to discuss a number of variable resolution approximations, an
example is introduced to facilitate the discussion.
Figure 8.1 is the plan view showing two of 10 similar multi-room floors in a
multi-storey building connected by two (East and West) banks of elevators. The
state of this MDP is described by three variables; floors (values 0-9), rooms (0-24)
and positions in room (0-8). An agent uses one step actions to move North, South,
East, West, up or down and starts on floor 0. The goal is to move to position “G” on
Approximations 166
Floor 0
elevators
G
...
Floor 1
Figure 8.1: The plan view of two out of ten floors connected via two banks ofelevators.
floor 1 and execute the up action (to say turn the coffee maker on). The up/down
actions only work at the elevator where they move the agent one floor at a time.
The reward is −1 for taking each action.
The HEXQ hierarchical decomposition for the multi-storey building MDP aggre-
gates room locations into a single state with 8 abstract actions, 4 to leave the room
via the North, South, East or West and 4 to execute up or down at each elevator or
“G” position. The top level SMDP has only 10 abstract states, one for each floor,
and 9 abstract actions, moving to each elevator and pressing up or down and moving
to the coffee machine and pressing up. The storage required to compute the value
function for the original MDP is 13,500 values (9 positions × 25 rooms × 10 floors
× 6 actions). The HEXQ decomposed model requires 1,3061, an order of magnitude
saving.
Three variable resolution approximations to reduce the computational complex-
1these values are calculated using equation 5.7.2 from Chapter 5
Approximations 167
ity beyond that of safe state abstraction are introduced. The work here is of a
preliminary nature and approximations are suggested as a future research topic.
8.1.2 Variable Resolution Value Function
The overall value function is composed from the value functions of sub-MDP com-
ponents at each level in the hierarchy. Even after learning is complete, execution of
a hierarchically optimal policy is not just a matter of using a simple lookup table
as for a flat reinforcement learner. HEXQ uses a best first search2 as a result of
the recursion in equation 8.2 to decide the next best abstract action. These deci-
sions are made when invoking sub-MDPs during learning, and after learning on final
execution of the decomposed problem.
Depth bounded search is a well known technique in problem solvers (Russell and
Norvig, 1995). Limiting the depth of the search in this case can approximate the
value function. The approximation is accurate to the extent that either:
• there is a diminishing return (or loss) in the finer detail of sub-tasks and their
contribution to the overall value function can be ignored or
• sub-tasks accumulate near constant internal reward during execution. In this
case the reward on exit can be made proportional to the accumulated reward
estimate within a sub-task plus exit, with the effect that there is no further
contribution to the value function from lower levels.
For example with a depth limit of zero, representing the most abstract approxi-
mation, equation 8.2 becomes
V ∗(g) = maxa
E∗(g, a) (8.3)
reducing nicely to the usual “flat” Q learning formulation for the problem3.
2as does MAXQ (Dietterich, 2000)3for HEXQ, with only one variable in the state description of the original MDP or for the first
Approximations 168
The depth first search in equation 8.2 has a time complexity of O(∏
d bd) where
bd is the branching factor or number of abstract actions at each level d, down to the
leaves of the HEXQ graph. The time complexity for a depth zero search is O(∑
d bd),
the sum of the number of abstract actions applicable at each level.
Varying the search depth may give better approximations at the expense of a
greater execution cost. By making the value function search depth variable, the
speed of decision making in HEXQ can be offset against the quality of the policy
during execution, providing an anytime execution procedure for HEXQ.
It is important to note that limiting the search to a particular depth does not
affect the ability to operate at more detailed levels. Limited depth resolution op-
erates at each level of the HEXQ hierarchy. Recall that in HEXQ the levels are
numbered from the bottom up. For example, at the highest level, level 4, a depth of
2 searches down to level 2 and at level 3 the search extends down to level 1. HEXQ
is still able to implement fine grain policy because when abstract actions at lower
levels are evaluated, they in turn resolve the local problem to their own set depth
limit.
In the multi-storey building example, limiting the value function to top level
values would result in choosing an arbitrary elevator to reach the first floor. In this
case the agent may lengthen the journey by not travelling to the nearest elevator
bank. Increasing the depth of search to one level may result in a shorter path, as
the distance to each elevator bank is included in the decision. For the Towers of
Hanoi puzzle in section 6.3, constraints ensure that the cost of an abstract action
is constant at each level (given the disks are stacked on one peg to this level). This
means that a depth zero search is sufficient to ensure an optimal policy, reducing
the number of values that need to be searched, at the top level in the 7 disk version,
from 67 = 279, 936 to only 6×7. A more than 3 orders of magnitude saving! This is
level of the hierarchy, the exit value function E is identical to the normal Q action value functionin reinforcement learning. This consistency is another motivation for the particular choice of thedecomposition in the way reward on exit is treated.
Approximations 169
a significant reduction in decision time complexity for choosing the next best action.
This means that HEXQ executes the hierarchically optimal policy (and in this case
the globally optimal policy) by simply taking the (abstract) action at each level that
maximises the exit value for that level only. Only seven decisions are required to
reach level one and take a primitive action.
In applying this approximation one needs to be careful. It is easy to construct
problems where the cost to goal is further from optimal when ignoring lower level
contributions. Nevertheless limiting the depth of search for the decomposed value
function is worth considering. This approximation can also be used with MAXQ
and provides one answer to the high cost of the value function depth-first-search
flagged by Dietterich (2000).
8.1.3 Variable Resolution Exit States
The number of region exits greatly effects the efficiency of HEXQ as it determines
the number of sub-MDPs that are generated and the number of abstract actions
that are explored at the next higher level. Savings can be realised by combining
exits that have similar exit states.
The HEXQ decomposition generates one region sub-MDP for each hierarchical
exit state. Recall from section 5.3 that a hierarchical exit state means that the exit
state is at full resolution in that it is not abstracted and therefore distinguished by
all the variable values along the path down to the base level of the hierarchy. This
was necessary to ensure safe state abstraction as the internal value to reach an exit
may vary for different base level states. The resolution of a state refers to the level
in the HEXQ hierarchy at which the state is represented abstractly. Limiting the
resolution of the exit states may reduce the number of sub-MDPs required. The
effect is to reduce the number of separate policies for each region at the expense of
possibly reaching each exit sub-optimally. The benefit is to reduce space complexity
and learning time complexity. The loss of resolving exit states will in general lead to
Approximations 170
an overestimation of the value function as some negative rewards within the abstract
exit state may be ignored.
For the multi-storey building example, consider the floor region defined by the
25 abstract room states. Separate sub-MDPs for this region are required for each
of the four elevators, despite each of two elevators sharing the same room. If we
generate only one sub-MDP per elevator bank, that is, per abstract room state, the
storage requirements to represent the value function are reduced from 1306 to 906.
In this example there is no loss in accuracy of the value function, but in general the
loss is limited to the internal difference in state values, at the limit of resolution, in
this case the intra-room distance.
8.1.4 Variable Resolution Exits
In MAXQ, goal termination predicates are defined by the user. Each termination
predicate could combine many exits discovered by HEXQ. As HEXQ automatically
constructs the hierarchy, the question is: is it possible to find criteria for automati-
cally combining exits?
To allow safe state abstraction, HEXQ generates one abstract action for each
region exit. In turn, each abstract action needs to be represented and explored at
the next highest level in the hierarchy.
In flat MDPs, if two actions produce identical state transitions and rewards
from all states, then these actions may be combined into one and the problem
reduced without detracting from the optimal solution. This motivates the heuristic
for HEXQ to combine exits from a region when they always transition to the same
next region.
Combining exits makes it easier for some sub-MDPs to exit as the choice of exit
has increased. This will in general increase the internal value function of a sub-MDP
as it is less costly to reach the closest of the combined exits. On the other hand
there is an increased loss of control as exits cannot be discriminated at the next
Approximations 171
level. The net effect on the value function and resultant policy depends on each
specific problem instance and needs to be tested. It should be noted that sub-MDP
termination, with combined exits, may no longer be totally independent of the entry
state. Nevertheless, combining exits may still provide a good approximation to the
safe state abstraction produced by HEXQ.
For the multi-storey problem, the two floor exits going up for each elevator bank
may be combined as they always lead to the same next state (the room on the floor
above). Similarly the down exits may be combined. The effect of combining elevator
exits is to reduce the storage space requirements from 1306 to 866.
As another illustration, consider the maze introduced in Chapter 1, this time
with the wider doorways shown in figure 8.2. With deterministic actions HEXQ
would generate 3 exits per doorway and 12 exits per room altogether. This would
result in 12 sub-MDPs and 12 abstract actions at the room level of abstraction.
If the problem is modelled at the resolution of the rooms only, the three exits per
doorway would be indistinguishable. Each would transition between similar rooms
with the same reward. Using the above heuristic, these exits could be combined into
one.
Goal
Figure 8.2: The simple room example with wider doorways.
Since only one sub-MDP would now be generated per doorway, the agent, having
Approximations 172
decided on the doorway to use, would leave the room by the nearest of the three
exits that were combined to form the one doorway exit. If the value function is
evaluated to full resolution then safe state abstraction could no longer be guaranteed
because the E action value function on exit would now depend on the agent’s starting
position in the room. An average E value for the three exits would result based on
the frequency that each type of exit was experienced.
Combining exits with similar abstract state transitions and reward functions
requires that HEXQ first identifies the exits and explores them at the next level
before they can be combined in the previous level. This means that HEXQ needs to
backtrack during hierarchy construction making the algorithm more complex. Back-
tracking may well become a feature of hierarchy construction as will be suggested
in section 9.6 to achieve more flexible and robust learning.
The idea can again be generalised to a variable resolution of abstract states.
Exits are combined if they have identical transition and reward functions for all
hierarchical exit states down to the same level of resolution.
There are other conditions for combining exits that suggest themselves. For
example, if the exit values E for a set of exits are the same for any abstract state se,
i.e. ∀ i, j, k : E(sei , a
ej) = E(se
i , aek), then the exits can be combined and treated as
one abstract action requiring only one sub-MDP. This type of combination will still
represent the value function losslessly and may improve the hierarchically optimal
solution. The condition that E values are the same may be generalisable to a small
difference between E values. Approximations to these conditions would also be a
good heuristic for combination exits.
A potential application of exit combination is the Ball Kicking problem from
Chapter 6, extended with another variable that determines the goal direction. Ab-
stract actions that reliably kick into one of the goals could be combined. With the
ball location variable specified relative to the target goal, the agent would then not
have to learn to kick to each end of the field separately.
Approximations 173
8.2 Stochastic Approximations
There are stochastic MDPs that look like they should decompose easily, but HEXQ
is unable to decompose them because of the strict requirement that the decomposed
value function is lossless. An example is the simple maze from Chapter 1, when the
actions are stochastic in the sense that say 70% of the time they move the agent in the
intended direction and 10% of the time in each of the three other directions. HEXQ
cannot decompose this problem in any useful way, as exits through a particular
doorway cannot be guaranteed and the reachability condition of the HEXQ partition
is violated.
It is possible for some problems to use a deterministic MDP policy to approximate
a stochastic MDP policy, if environments can be identified where the stochasticity
does not cause undue adverse effects. Some stochastic problems that produce poor
automatic decompositions may produce better decompositions with deterministic
actions. This means that approximate, but compact, solutions can be found with
HEXQ to these problems.
An example of a benign environment is a navigation problem where the error
distribution following each action is symmetrically distributed about the direction
of travel, it is always possible to reverse the direction and the reward is constant per
transition. A useful future research topic would be to formally state conditions and
possibly error bounds for MDPs, to approximate a stochastic MDP policy with a
deterministic MDP policy. Liptser et al. (1996), for example, show that for certain
continuous problems, when the noise “intensity” is small and the fluctuations become
fast, the stochastic problems can be approximated by a deterministic one.
Without a model, the deterministic approximation to stochastic actions may not
be available. In this case it may be possible to use the transition with the highest
probability or payoff. In Euclidean metric spaces the transition closest to the sample
average of the next states may be a good candidate.
When benign conditions prevail, the HEXQ decomposition is particularly en-
Approximations 174
O O O
O O
O O O
OO
O O
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9
Figure 8.3: Kaelbling’s 10 × 10 navigation maze. The regions represent a Voronoipartition given the circled landmarks.
couraging because the decomposition with deterministic actions has been proven to
be globally optimal (section 4.3.1). The deterministic MDP policy, as an approxi-
mation to the stochastic problem, should therefore produce good solutions. In the
worst case, a deterministic MDP policy may be arbitrarily bad or fail to reach an
exit altogether.
As an example, consider a maze task introduced by Kaelbling to show how aiming
for landmarks can reduce the computational complexity of an MDP by sacrificing
global optimality. In the maze (figure 8.3) an agent must learn to navigate from a
random start location to a random goal location by the shortest route on the 10×10
grid. After reaching the goal the problem terminates and a new starting position
and goal are chosen at random. The actions available to the agent are to move one
square to the North, South, East or West. The interesting feature about this maze
is that each location is itself a goal.
To solve this maze it is necessary to solve the equivalent of 100 MDPs, one for
each goal location. The size of the state space is therefore 100 locations ×100 goals
= 10,000 states.
In Kaelbling’s maze the agent is assisted by 12 fixed landmarks (indicated by
Approximations 175
circles in figure 8.3) spread arbitrarily throughout the grid. The locations are parti-
tioned into Voronoi regions indicating their closest landmark. In a stochastic setting,
Kaelbling’s error distributions on each action are as follows: the agent lands in the
intended cell 20% of the time and 20% of the time in each of the 4 neighboring
cells. If the actions move the agent outside the boundary, the agent is moved to the
closest internal location.
This task is represented as an MDP to HEXQ, with the states described by a
vector of 3 variables, namely: agent-location (100 values), goal-location (100 values)
and closest-landmark (12 values). A dummy action is introduced to allow the agent
to signal arrival at the goal. Primitive rewards are −1 for each action.
HEXQ cannot decompose the stochastic version of this problem. The reason
is that the reachability condition of the HEXQ partition cannot be satisfied. In
any sub-region of the maze the agent may not reach the intended region exit with
probability one because the stochastic drift may cause it to leave at any boundary
location.
The whole point of Kaelbling’s original paper was to show how, using landmarks,
it is possible to learn an approximate navigation policy to move to any goal at a
small cost from optimality efficiently. The basic idea in HDG is to navigate at the
abstract level of landmarks and only make local decisions at the primitive higher
level of resolution.
HEXQ can solve this problem using the approximation discussed above. Having
automatically decomposed and solved the deterministic 10 × 10 maze optimally,
the solution to the stochastic problem is to execute the deterministic MDP optimal
policy using hierarchical greedy execution. A hierarchical policy will lock in subtasks
until they exit. When actions are stochastic it is possible that the agent will drift
to locations where the sub-task policy is no longer appropriate. Hierarchical greedy
execution continuously interrupts each sub-task execution and recalculates the top
level value function to find the next hierarchically optimal primitive action.
Approximations 176
5
7
9
11
13
15
17
19
0 50000 100000 150000
Trials
Ave
rag
e S
tep
s p
er T
rial
flat
HEXQ
Figure 8.4: Performance of HEXQ on stochastic version of Kaelbling’s maze. Stepsper trial are collected in buckets of 10 samples over 10 runs. The trend lines are 255point moving averages.
Figure 8.4 shows the detail of the performance of a flat reinforcement learner and
HEXQ for the stochastic version of the 10× 10 maze with hierarchical execution of
the previously found optimal policy using deterministic actions. HEXQ completes
construction of the hierarchy and finds an optimal policy by about 30,000 trials.
The step up in the graph for HEXQ at 50,000 trials is where stochastic actions
are switched in. The flat RL uses a learning rate of 0.25. From value iteration we
know the optimal performance is 9.24. HEXQ using the stochastic approximation
achieves 10.1 average steps per trial. The flat learner achieves 10.1 after about
200,000 trials and would presumably converge to the optimum with further trials
and with a gradual reduction in the learning rate4.
Stochastic transitions may thus be approximated by deterministic transitions in
some forgiving problems allowing HEXQ to decompose the problem.
4see appendix B for further comments
Approximations 177
8.3 Discussion and Conclusion
As discussed in section 4.3, a hierarchical representation usually constrains the set
of policies available for the original problem. In hierarchical reinforcement learners
such as MAXQ, HAMQ and HEXQ there are no guarantees in general that the
hierarchically optimal policy is close to the global optimal policy. In this sense a hi-
erarchy, whether user defined or generated automatically, generally already provides
only an approximate solution to the original problem.
This Chapter has introduced extensions to the basic HEXQ algorithm that allow
a decomposed value function to be represented approximately. The appropriateness
of each approximation is problem dependent, but the accuracy may be controlled
by varying the resolution of the model. This provides an anytime mechanism for
learning and execution trading off decision quality against time.
Many questions are left unanswered, such as clear statements for the conditions
under which the various approximations will work and precise error bounds. It may
be possible to learn the most appropriate depth of resolution at each level of the
hierarchy. These and other questions are left for future research.
Chapter 9
Future Research
The objective of this thesis has been to discover hierarchical structure in reinforce-
ment learning. The key idea embodied in the HEXQ algorithm is to find reusable
sub-tasks and abstract them to form a reduced model of the original problem that
requires significantly less space and time for its solution.
This Chapter will discuss various directions for future research that builds on this
foundation, removing assumptions, increasing the scaling potential of hierarchical
reinforcement learning and moving towards more complex, realistic environments.
9.1 Discovering Exits
One of the assumptions made by HEXQ is that all exits are discovered at each level
using a random search. For the problems in this thesis the exploration time was
set manually to ensure that this was the case. As explained in section 5.2, missing
important exits may result in a poorer solution or failure to find a solution.
The 25 rooms experiment in section 6.2 highlighted the weakness. The heuristic
to discover exits by testing temporally close samples of transitions, failed to find all
the exits for some stochastic specifications of the problem. Other statistical tests,
such as the Chi squared test, have been tried, but the performance of the heuristic
depends very much on the transition structure of the problem. If the structure is
178
Future Research 179
such that a temporal sequence of samples under a random action policy moves the
agent to different locations in the state space, then the distribution of the temporal
sample may not be able to discover an exit in a particular part of the state space.
There are several solutions to this issue. Of course, if the original problem is
known to be Markov, then it is tempting to collect sample data for each lower level
transition probability distribution in the context of the remaining variables. These
distributions are certain to be stationary by assumption. It is an easy matter to
test for exits to any level of significance by collecting enough data and comparing
distributions. The problem with this approach is that the number of contexts can be
exponentially large, depending on the variables still to be processed in the ordered
state vector. The “curse of dimensionality” is transferred to the enumeration of the
contexts and this is impractical for large problems.
One solution is to test for unusual circumstances at a lower level of significance
and tag these situations as specific conditions under which to collect more reliable
statistics before declaring exits. In this way only a practical selection of contexts
are tested and the test can be conducted to any level of significance by simply
collecting enough data. Humans seem to use this method. For example, if an
observation is made that seems unusual, attention is focussed on that particular
circumstance until it is decided that it was either a coincident or a new phenomenon.
Unusual circumstance may, for example, still be identified by testing a temporally
close sample, but this time the lower level of significance of the test will only flag
situations for further testing. The key is not to commit to these potential exits until
they are confirmed by testing in their specific contexts.
A further extension is to be able to add exits belatedly. Some exits may be hard
to find or may not be needed to solve the problem. If exits could be accommodated
at any time during hierarchy construction, this would assist in two ways:
• The HEXQ time complexity can be reduced by searching for exits concurrently
with region discovery at the next higher level.
Future Research 180
• An impasse or aberration discovered at a higher level due to a missed exit at
a lower level can be repaired.
More recently Potts and Hengst (2004) have shown how learning can be speeded
up further by allowing exits to be concurrently discovered and assimilated at multiple
levels.
9.2 Stochasticity at Region Boundaries
HEXQ will only find meaningful decompositions if regions can be found such that
their exit states can be reached with certainty. Even when this is the case, the
constraint on policies inherent in a HEXQ hierarchy means that low cost region
leaving policies may be missed, as was illustrated in section 4.3. This raises the
issue of whether the stochasticity can be better handled at region boundaries.
One answer is to combine exits as suggested in section 8.1.4. While this may
improve the value function locally for a reusable subtask, it gives up the opportunity
to choose a specific exit in each particular instantiation of the subtask. In HEXQ it
also introduces an approximation for the hierarchical policy, as the exit transition
may no longer be independent of the way the region was entered. Having said that,
the overall solution may improve depending on the tradeoff in the ease of exiting
against the precision of exit control required. For example, exiting a multiple state
wide doorway may be easier when the exits states are combined, but if exiting other
than via the central state leads to injury then the combined exit is not such a good
idea. This tradeoff is problem dependent and may be able to be learnt.
Another interesting idea is to explore the general possibility that stochasticity
at lower levels in the hierarchy can be contained and controlled when constructing
models at higher levels. If a robot navigates through rooms, its stochastic actions
may cause it to slip. Sensor readings, indicating its location, will have errors. The
probability of occupying a particular room, however, is less error prone. Finney,
Future Research 181
Hernandez, Oates, and Kaelbling (2001) suggested similar research objectives.
Other related work are the findings by Lane and Kaelbling (2002) in which care-
fully constructed macros “hide” navigational stochasticity to produce a deterministic
planning problem.
In contrast to the partitioning employed by HEXQ, Moore et al. (1999) use
overlapping regions in the “Airport Hierarchy” to decompose the navigation space
(see section 3.5.5). This suggests another approach to automatically decomposing
MDPs to contain stochasticity. The abstract model of moving between overlapping
regions will exhibit less stochasticity than that for the base level states. If repeat-
able overlapping Markov sub-space regions can be found, then transitions between
regions may be approximately deterministic. In the multi-room problem, section
6.2, for example, the doorways between rooms are similar. A region defined around
a doorway would allow stochastic exits to be modelled and contained, avoiding the
issue of combining exits. The agent navigates with more certainty from the center
of the room to the doorway with control passed from region to region in the middle
of their overlap.
9.3 Multiple Simultaneous Regions
HEXQ orders the variables of the MDP by frequency of change. This ordering is an
effective but crude heuristic to construct a hierarchy. The heuristic hopes that it can
uncover variables that have different temporal frequencies of response. Each level
in the HEXQ hierarchy is associated with one variable producing a “monolithic”
hierarchical structure.
A more efficient decomposition may be to have HEXQ find separate partitions
for all the remaining variables at each level in the hierarchy. In this case, the same
exploration effort may uncover multiple abstractions concurrently.
Consider the simple room navigation problem in figure 9.1. In this episodic
Future Research 182
4
3
2
1
0
0 1 2 3 4
Y
X
goal
Figure 9.1: A simple navigation problem to leave a room. The agent uses one stepmoves in 4 compass directions to reach the goal state from any starting position.
problem there are 25 states, each labelled with their x and y coordinates. The agent
can deterministically move one step North, South, East and West. It receives a
reward of -1 per step. The objective is to reach the goal, whereupon the problem
terminates and the agent is restarted at random somewhere on the grid.
Ordering the x and y variables in order of frequency, HEXQ will choose one
randomly above the other. If the first variable chosen is y, then HEXQ will find
one region at level 1 in the hierarchy containing the 5 states defined by the values
of the y variable. The actions North and South allow navigation within the region
and East-West actions are exit actions. Level 2 in the HEXQ hierarchy will have 5
abstract states and the problem will be solved by navigating to the abstract state
associated with x = 4 and taking abstract action (y = 4, east).
Searching for Markov sub-space regions for each of the two variables simultane-
ously will produce two different partitions of the state space, one for each variable,
as depicted in figure 9.2. At level 2 in the hierarchy, abstract states are defined
by the Cartesian product of the two region labels. Exits are tagged with the other
variables they change. Combining regions at the next level, the only exits retained
Future Research 183
0 1 2 3 4
X region0
1
2
3
4
Y region
0
goal
goal
(x=4, y=4, east)goalLevel 2
Level 1
Figure 9.2: New hierarchical structure for the simple navigation problem using mul-tiple partitions at level 1.
are ones with the same exit action that change the same variables. Other exits can
be discarded, as they are absorbed within the other region. Combination produces
only one abstract state and one compound abstract action (x = 4, y = 4, East), at
level 2, leading to the goal. The others cancel out. The execution of this type of
compound abstract action would move x to position 4, y to position 4 and perform
primitive action East.
This simple problem shows the extra compaction potential from multiple par-
titions. The flat problem requires 25 states × 4 actions = 100 Q values. HEXQ
require 5 states × 4 actions × 5 exit states + 5 states × 11 exits = 155 E values.
Using simultaneous regions would require 5 states × 4 actions × 2 regions + 1 state
× 1 exit = 41 E values.
9.4 Multi-dimensional Actions
Complex actions may be represented in factored form. Multi-dimensional actions
are interpreted as concurrent along each dimension. For example, speaking, walking
and head movements may be represented by three action variables. The action space
Future Research 184
is defined by the Cartesian product of each of the variables.
Decomposing a problem by factoring over states and actions simultaneously re-
sults in parallel hierarchical decompositions. Parallel decomposition refers to an
MDP broken down into a set of sub-MDPs that are “run in parallel” (Boutilier
et al., 1999). Factoring over actions alone has been considered by Pineau et al.
(2001).
The example from the above section can be used to illustrates the point. If
the actions are 2-dimensional with one action variable having values {North, South,
Stay} and another with values {East, West and Stay}, it would be possible to take
simultaneous actions, such as (North, West) or (Stay, East).
Unfortunately, the problem does not decompose, because moving in two direc-
tions simultaneously changes both the x and y variables. However, applying the
action variables separately for each state variable discovers the same decomposition
as in figure 9.2. In this case the North-South-Stay actions are associated with the
y state variable and the East-West-Stay actions with the x variable. The hierar-
chy is now interpreted to be able to execute both regions simultaneously, effectively
moving the agent diagonally towards the goal. It is unclear how a value function
would be decomposed for these parallel hierarchical decompositions if rewards are
not independent and cannot just be added.
9.5 Default Hierarchies
The declaration of an exit by HEXQ indicates that a transition has been found at a
particular level that cannot be explained (represented by a stationary probability).
When HEXQ discovers a new exit it generates a new abstract action. The abstract
action is explored for all values of the next state variable. It can often happen that
the exit represents an unusual situation that is unlikely to repeat.
Figure 5.3 in Chapter 5 gives an example. The corner obstacle may only appear in
Future Research 185
one room, yet, once exits are discovered as a result of the obstacle, it is hypothesised
that they might occur and and are tested in all the other rooms.
If these rare situations can be detected, it would be possible to isolate them
to particular contexts and learn sub-MDPs separately for those contexts only. For
the above example, a new set of sub-MDPs would be learnt to exit the room con-
taining the obstacle, but no abstract actions associated with the obstacle would be
generated. The hierarchy would then be required to learn the default condition for
switching in the obstacle-in-room policy cache.
The idea of default hierarchies comes from Holland (1995) where they play a
similar role in extending models by noting rule exceptions to more general cases.
9.6 Dynamic Hierarchies
There are many reasons why initial hierarchical models may require adjustment or in
fact major revision as the agent learns. An agent may experience atypical situations
early in its life, be misled by a noisy environment, train on poor concepts, or be
shown a better policy.
An agent should be able to discard exits that have low utility or incorporate
newly discovered ones. For these cases it is necessary to backtrack levels in hierarchy
construction. To implement this flexibility in HEXQ, it is necessary to continue to
monitor transitions at lower levels. When new exits are discovered, additional sub-
MDPs and abstract actions are introduced. These require exploration at higher
levels and may result in the revision of partitions. When rare exits are discovered it
may not be worth revising a whole body of previously found knowledge but rather to
use default hierarchies, discussed previously, to isolate the exceptional circumstances
in which the exit occurred. If enough evidence accumulates, a more drastic revision
may be necessary. Practice may uncover shortcuts or not require some previously
generated abstract actions. In this case policies should be allowed to atrophy and
Future Research 186
release their storage requirements for reuse in other areas.
An issue worth exploring further, is the efficiency of concurrent layered learning,
in which lower levels are allowed to keep learning while simultaneously training
higher levels (Whiteson and Stone, 2003). This is a feature of MAXQ. HEXQ
continues to refine the lower level policies while constructing the higher levels, but
not before all the exits are deemed to have been found.
Concurrent HEXQ (Potts and Hengst, 2004) is a step towards making the hier-
archy dynamic. It not only learns concurrently at multiple levels but also constructs
the hierarchy concurrently at multiple levels. Results for one example show concur-
rent HEXQ learning more than an order of magnitude faster than HEXQ which has
itself reduced the learning time over a flat learner multiple times.
Dynamic hierarchies are also envisaged to allow flexible adaption to changing
circumstances. When an agent is placed in a new environment it would be desirable
for it to adapt and not relearn every sub-task. In this case it will need the ability
to retain some concepts while modifying others.
It is likely that in time, for an agent working in the one environment, base level
concepts will not change and become crystalised as higher level concepts are built
on their foundation. The process of crystallisation is consistent with Utgoff’s notion
of the “frontier of receptivity” (2002).
9.7 Selective Perception and Hidden State
Future research discussed so far is designed to improve the HEXQ decomposition
in various ways, predicated on the assumption that the problem is specified as an
MDP. The next sections will propose future research that does not necessarily make
the assumption the state vector and actions are specified so as to give rise to Markov
transition and reward functions in a straightforward way.
HEXQ relies on an MDP specification as a starting point. One future research
Future Research 187
direction is to attempt the automatic decomposition of a sensor state that may con-
tain redundant variables or one that does not exhibit the Markov property. Figure
9.3 is used to illustrate the challenges and possible approaches.
Sensor = ( x, y, z )Effector = ({E, W, Stay},{N, S, Stay})
4
3
2
1
0
Y
0 1 2 3 4
X
goal
0 1 2 3 4
01
2
3
4
Z
Figure 9.3: A partially observable environment with two independent tasks and2-dimensional actions.
This example shows two rooms of the type in figure 9.1 joined via a doorway.
The position in each room is described by the (x, y) coordinate, except that the x
variable values repeat in each room making the environment partially observable.
The agent cannot distinguish its unique position from the x and y variable labels
alone. The actions are factored by North-South and East-West variables making it
possible, for example, to move North and West simultaneously. It is also possible
for the agent to stand still, generating a total of 9 actions: North-Stay, North-East,
Stay-East, South-East, South-Stay, South-West, Stay-West, North-West, Stay-Stay.
Another variable, z, changes value cyclically, irrespective of the action taken, except
that, when at location z = 3, taking action Stay-East, terminates the task. The
task is also terminated by leaving the right room at the top right hand corner as
illustrated. All steps incur a cost of -1 and the rewards on termination are fixed at
a +ve value. The agent is replaced at random after termination.
It is anticipated that an automatic decomposition could find the hierarchical
Future Research 188
1
0 2
34
Z region
0 1 2 3 4
X region 0
1
2
3
4
Y region
00Room region
PreviousRoom region
< >
0
Two Room Region
Two Tasks
Figure 9.4: The expected result of an automatic decomposition of the problem infigure 9.3
structure illustrated in figure 9.4. Each variable’s discovered Markov region is shown
at the bottom of the figure. As the z region alone can reach termination reliably
there is no need to combine it with the x and y regions.
Regions x and y cannot reach the other goal reliably. Combining them as in
section 9.3 helps, but this is not the whole solution because of state aliasing. By
generating history variables from given variables it is possible to uncover hidden
state. The idea is to simultaneously use selective perception and uncover hidden
state to achieve the task at hand as in McCallum’s (1995) UTree approach. An
advantage in a hierarchical structure is that history at more abstract levels can reach
back many primitive time steps as shown by Hernandez and Mahadevan (2000).
Generating memory at the abstract level allows the rooms to be disambiguated
based only on the two abstract actions to leave the abstract room region. It avoids
having to solve the much harder hidden state problem at the primitive level which
would require, at minimum, a history going back 5 time steps.
The characteristics of the idea suggested here are that the hierarchy is still dis-
covered automatically, state abstraction is used as in HEXQ, redundant variables
Future Research 189
are eliminated and variables at higher levels in the hierarchy may even be invented.
An example of the latter characteristic is the formation of the the room variable in
the above problem. Rooms are not represented at all in the original problem but
arose out of an aliased abstract state formed by combining two regions.
9.8 Deictic Representations
In deictic representations variables define a situation relative to an agent. This
allows many variables or actions that would otherwise be defined with absolute
reference to be represented relatively and compactly. In natural language, examples
of deictic expressions are “my car”, “the ball in front of me” and “the top disk
on peg 2”. One interpretation of HEXQ regions as a class, is that they are deictic
expressions in that the higher level variables index their instantiations. For example,
the robot-location-relative-to-the-ball variable is a deictic representation in the ball
kicking example of section 6.4. The ball is represented by this variable in a state
relative to the robot, irrespective of its position on the field. For the Towers of
Hanoi example, the actions, movexy, are an example of deictic actions, as the disk
moved is implicity the one on top, but could be any particular disk. This is one of
the reasons HEXQ is able to generalise.
An interesting research direction would be to learn deictic representations if the
state or action variables provided are object specific and weakly coupled. Two
subsystems are weakly coupled when their influence on each other is at longer time
scales than interactions inside each subsystem, or when they are coupled by a small
subset of variables. The total system may be described by a vector of variables, but
it is possible to take various subsets of variables and construct self contained models.
The multiple partitioning approach suggested above can therefore be seen as a way
to exploit decoupled or weakly coupled systems. In these systems there may be a
large number of variables (where many may be dependent on just a few elementary
Future Research 190
ones). This could help to make the task of learning in complex problems tractable.
A robotic example is the ball collection challenge in the Sony legged league at
the 2002 RoboCup competition in Fukuoka, Japan (Olave et al., 2003). Two robots
were required to shoot 10 randomly placed balls into either goal in under 3 minutes.
A total environment description would include the position of each of the individual
balls. Of course the task of shooting a particular ball into a goal is only weakly
coupled to the task of shooting any other ball into the goal. The weak coupling is
manifest in that only occasionally another ball or robot presents an obstacle and
that there may be a best order in which to shoot the balls. By programming each
robot to only see its closest ball and to shoot this ball into the goal, the robots were
able to operate successfully by working on a much smaller sub-model of the total
system.
In the example in figure 9.4, a top level focus switching task could be generated to
switch between the two sub-problems. The deictic effect here is to switch attention
by “pointing” to either sub-problem and ignoring the other variables.
The research issues these type of situations present are also raised by Finney
et al. (2001). They suggest once again looking at blocks world scenarios. The
blocks world problem cannot be decomposed by HEXQ and Knoblock (1994) could
not automatically decompose the blocks world problem with ALPINE.
It would be interesting to see if problems such as the ball collection and the blocks
world could be automatically decomposed with the added machinery of hierarchical
state and action abstraction using the multi-partitioning suggested above. Despite
the negative results reported by Finney et al. (2002), introducing deictic variables in
a hierarchical structure appears not only to present compaction opportunities but
seems to account for the way humans abstract (Ballard et al., 1996).
When the interaction amongst problem variables increases, even humans have
difficulty solving these problems. This may be the reason why humans find games
like chess and Rubik’s Cube challenging. It might be instructive to direct research
Future Research 191
at the more mundane problems of learning everyday tasks such as cooking or simple
parts assembly. This should allow the study of the underlying concept forming
mechanisms before these algorithms are stress tested on hard problems.
9.9 Quantitative Variables
Variable values in this thesis can be interpreted to reflect the arbitrary states of
an agent’s sensor. HEXQ therefore only assumes that the variables are qualitative
(see section 4.1). This does not preclude the possibility that variables represent
quantitative values, but the algorithm is not predicated on this possible inductive
bias.
The general philosophic approach is that if variables do have a natural order or
describe a metric space, then this needs to be learnt by the agent.
It would be worthwhile to extend HEXQ by assuming quantitative variables.
Various dimensionality reduction techniques, such as linear combination, may then
possibly be applied to a larger set of variables as a preprocessing step to HEXQ.
HEXQ might also be combined with methods such as principal component analysis,
independent component analysis or kernelised versions of these methods.
If the quantitative variables are continuous, then it may be possible to define
mappings into a discrete multi-variate state space that lends itself to HEXQ de-
composition. For example, it is possible to decompose the pole and cart problem
(Anderson, 1986) into two hierarchical levels. At the first level the angle and angular
velocity variables can define balancing subtasks with various accelerations to the left
and right. At the second level, the translational distance and velocity can be learnt
by switching between subtasks.
The learning of the quantitative nature of variables (both discrete and contin-
uous) and their exploitation in terms of hierarchical decomposition is beyond the
scope of this thesis and is left for future research.
Future Research 192
9.10 Average Reward HEXQ
The use of average reward or discounted sum of future rewards optimality criteria
only becomes an issue when problems are infinite horizon. Chapter 7 shows how
a dual set of value functions can decompose the discounted sum of future rewards
over a task hierarchy.
There may be advantages in using average reward reinforcement learning with
task hierarchies (Ghavamzadeh and Mahadevan, 2002). The authors make the incor-
rect assumption that, given a stationary policy, the average reward for each subtask
is constant. However, the average reward for a subtask can vary depending on how
the task was initiated and in general reusable subtask will be initiated differently
depending on their context.
It is believed to be possible to define a dual value function decomposition for
average reward hierarchical reinforcement learning in the HEXQ setting along sim-
ilar lines to the decomposition of the sum of discounted future rewards. The two
functions for average reward decomposition are the average number of time-steps to
exit and the average reward to exit from any state in the subtask.
The details and confirmation of the formulation will be left to future work.
9.11 Deliberation
Planning and reinforcement learning are closely related. Sutton and Barto (1998)
explain how both planning and learning can be integrated by allowing both methods
to update the same value function. A planning process is formulated as search over
a state space generated from a model. A model (a state transition function) can be
learnt from the exploration efforts of a reinforcement learner. The model can then
be used for planning.
We have already seen hierarchies crafted by combining both reinforcement learn-
ing and planning (Ryan, 2002). Is it possible to discover abstractions such as rela-
Future Research 193
tional descriptions of planning operators automatically? Some work in this area has
already started (Theocharous and Kaelbling, 2004). The automatic discovery of hi-
erarchy with HEXQ may be extended to re-representing attribute value descriptions
in relational terms and allowing hierarchical reinforcement learning to be seamlessly
combined with planing algorithms.
9.12 Training
Even though hierarchical representations can compress complex system behaviour,
finding the appropriate building blocks or intermediate subgoals may be very ex-
pensive if an agent is left to its own devices. While much of the required bias is
expensive for the agent designer to provide beforehand, or indeed unknown, the
agent can be designed to take advantage of an environment where its experiences
are structured by a trainer to assist the process of concept formation. This would
allow an accelerated progression of the frontier of receptivity (Utgoff and Stracuzzi,
2002).
If the agent is autonomous and any interaction can only come via its sensors
and effectors, how is it trained? Even if it is possible, in principle, to change the
“program” of such a learning agent, it may be too complex or labor intensive for a
programmer to understand the self developed internal representations. One solution
is to develop training programs and curricula in which the training is delivered via
the sense-act loop of the agent. A future research direction is to find ways to design
software that can benefit from a trainer via the base level sense-act loop.
A basic approach may be for the trainer to place the agent in that part of the
environment where the agent is likely to have the most fruitful experience. This is a
variation on the idea of the trainer controlling the reward structure. When subgoals
are crystalised, the intermediate reward can be removed and training take place
at more abstract levels. This basic form of training would not require any special
Future Research 194
representation of the training situation by the agent.
A more advanced solution is where the agent is able to copy the trainer’s actions.
This requires the agent to see itself in the trainer’s shoes, so to speak. It seems
monkeys can do this (hence one of the meanings of the word ape), but dogs cannot.
At the highest levels of abstraction, languages, including scientific languages of
mathematics and logic, provide the medium for rapid training in humans. Machines
would most likely need to learn representations that correspond to human under-
standable concepts so that human computer communication can take place. This is
likely for two reason: (1) when both machines and humans learn in a similar envi-
ronment both are confronted with similar constraints shaping their concepts and (2)
if concept communication takes place throughout a learning lifetime, the concept
hierarchy that is crystalised is likely to be similar between the machine and the
human. As the next level of learning builds on the previous, having a common base
of concepts should make training easier.
9.13 Scaling to Greater Heights
The set of features describing the environment is assumed to be provided to HEXQ.
Where do these features come from?
For an autonomous system, at the lowest level of interaction the features are the
raw sensor inputs and the actions the raw effector commands. The vision sensor
and effectors for a Sony ERS-210 AIBO robot, shown in figure 9.5, make the point.
Its head and legs have 15 degrees of freedom and can be independently activated
to start to move to any target position specified in micro-radians at 8 microsecond
intervals (in time) using PID control. The vision sensor has 25,344 pixels each with
one of 16,777,216 possible colour values streaming in at 25 times a second with the
potential to generate a very large state space. This state space is both highly aliased
and redundant. An open question is how to learn to compress it to manageable size
Future Research 195
in the pursuit of higher level objectives such as winning at soccer.
Figure 9.5: The Sony AIBO robot showing an image from its colour camera.
There is a pressing need to tackle these larger scaling issues. Future research is
suggested to use the hierarchical principles outlined in this thesis together with the
right inductive bias for an agent to learn multi-levels of description. The objective
is to bridge the semantic gap between the raw sensor and effector interaction and
higher level concepts to achieve complex tasks such as playing soccer.
An outline of how hierarchic concepts might develop is as follows: an agent
is assumed to have learnt (through evolution) that the visual state space can be
abstracted to edges at various orientations. There is evidence that these features
are present in the occipital lobe (Hubel and Wiesel, 1979). It may also detect blobs
of colour or texture. At the next level of abstraction these features may be able
to form regions, corresponding to objects such as balls, goals, blocks, etc. Further
abstraction may find regions that describe the dynamic interactions between the
agent and the object, as for example in the stylised soccer player, in chapter 6, that
learns to kick the ball. While this is speculative, it would be interesting to see if
these types of hierarchies can be learnt in sophisticated machines.
Future Research 196
9.14 Conclusion
This chapter has discussed a number potential research directions to build on the
ideas introduced with HEXQ. The major topics and directions are:
• Extending automatic MDP decompositions with multiple partitions and multi-
dimensional actions.
• Relaxation of the initial MDP assumption, anticipating hidden state and using
selective perception.
• Developing more robust and flexible dynamic hierarchies.
• Creating new representations from base level variables that are deictic or re-
lational.
• Scaling to realistic autonomous robotic applications.
Chapter 10
Conclusion
This thesis has started to tackle the open problem of automatically decomposing
multi-dimensional reinforcement learning problems. It has presented decomposition
conditions and an algorithm that have met the objective of reducing the computa-
tional complexity automatically. The empirical results support both the theory and
the successful operation of the HEXQ algorithm.
The invariant decomposition approach employed by HEXQ is only applicable to
finite MDPs and its effectiveness depends heavily on the particular representation
employed. It relies on variables having different time scales, variables being con-
sistently labelled and a constrained type of stochasticity. It is not a requirement
that the user knows whether the problem is structured in this way. The HEXQ
decomposition will succeed to the extent that these constraints are present.
One of the great advantages of the decomposition is the ability to generalise to
unseen situations. The HEXQ decomposition is predicated on a multi-dimensional
description of the state space.
The next sections will summarise the main ideas and some potential research
directions.
197
Conclusion 198
10.1 Summary of Main Ideas
Markov Subspaces. The key idea in this thesis is to uncover invariance in the form
of subregions of the state space that can then be abstracted. In both cases a
model of a part of the world is tested for repeatability in various contexts and
cached away for reuse. In HEXQ the (Markov equivalent) regions are modelled
as invariant sub-MDPs in the context of all the other variables. These regions
are have the property that they can be state abstracted and their policies
treated as abstract actions in a reduced (abstract) semi-MDP.
Optimality of HEXQ. To make hierarchical learning tractable, the cache of poli-
cies is usually constrained over each region. In HEXQ, only policies leading to
each hierarchical exit states are cached. Unfortunately, this constraint on the
policies means that it is now impossible to make any optimality guarantees.
There are currently no known methods for constraining subregion policies and
guaranteeing a globally optimal solution in general. However, for determinis-
tic shortest path problems, or problems in which only the top level sub-MDP
is stochastic, the HEXQ constraints are proven to provide a globally opti-
mum solution. For stochastic shortest path problems HEXQ is hierarchically
optimal.
State Abstraction, Discounting and Infinite Horizon MDPs. The introduc-
tion of an additional decomposed discount function allows safe abstraction
when the discount factor is less than one. This dual function decomposition
makes it possible to hierarchically decompose and compact infinite horizon
MDPs where a solution may require the agent to continue in a sub-task for-
ever.
Abstract Q Functions. The HEXQ decomposition uses a value function that is
decomposed over various levels of resolution in the hierarchy. An important
aspect of the formulation of the decomposition is to treat the primitive reward
Conclusion 199
on exit of a sub-MDP as belonging to the abstract state transition and not
as a part of the internal cost to reach the exit. The HEXQ decomposition
equations reduce nicely to the normal Q learning equations if there is only one
variable in the state description or when the problem is approximated at a
particular level of resolution. The HEXQ action value function E is seen to be
the abstract generalisation of the normal Q function.
HEXQ was demonstrated on a range of finite state problems. The empirical
results have confirmed the theory and algorithmic integrity of HEXQ. They have
also highlighted some of the finer points and exposed limitations that require further
research.
10.2 Directions for Future Research
A number of research themes have emerged from this early work:
Approximation. The HEXQ decomposition represents a hierarchy of abstract
models. This structure suggests using variable resolution to make approxima-
tions when constructing and solving the HEXQ hierarchy. While the accuracy
of the variable resolution depends on the problem characteristics, it is possible
to save on storage, learning time and execution time by limiting the number
of levels that are taken into account.
General Hierarchies. One of the compelling future directions of this line of re-
search is to construct more general hierarchies by searching for abstractions
over sub-sets of variables simultaneously. This retains the key idea of finding
invariant sub-spaces, but avoids the need to have one order for the variables. It
is envisaged that both selective perception and uncovering hidden state along
the lines of UTree (McCallum, 1995) can extend the automatic discovery of
hierarchical structure.
Conclusion 200
Training instead of Programming. Relying on trial and error self exploration
alone is likely to be insufficient to allow an agent to learn effectively in complex
environments. The solution will most likely involve a teacher in the agent’s
environment who can structure the agent’s experience into a curriculum to
best assist the formation of a useful abstraction hierarchy. For this exercise to
be more than just training by varying the reinforcement signal, the agent will
need to have the sophistication to picture itself in the shoes of the trainer to
clone behaviour and to share communicable abstract concepts.
10.3 Significance
It seems crucial that effective intelligent agents should be able to view complex
environments at different levels of abstraction, decompose, focus and re-represent
problems.
This thesis has made a contribution towards this longer term objective. Within
the limited scope of a multi-dimensional reinforcement learning problem, the task
of decomposition, state and action abstraction, sub-goal identification and hierar-
chy construction is automated. Automation means that some large reinforcement
learning problems, that are otherwise intractable, can be solved efficiently, without
a designer having to specify the decomposition manually.
The belief is that the discovery and manipulation of hierarchical representations
will prove essential for lifelong learning in autonomous goal directed agents.
Bibliography
Saul Amarel. On representations of problems of reasoning about actions. In Donald
Michie, editor, Machine Intelligence, volume 3, pages 131–171, Edinburgh, 1968.
Edinburgh at the University Press.
Charles W. Anderson. Strategy learning with multuilayer connectionist representa-
tions. In Proceedings of the Fourth International Workshop on Machine Learning,
pages 103–114, Can Mateo, CA, 1986. Morgan Kaufmann.
David Andre and Stuart J. Russell. State abstraction for programmable reinforce-
ment learning agents. In Rina Dechter, Michael Kearns, and Rich Sutton, editors,
Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages
119–125. AAAI Press, 2002.
Ross Ashby. Design for a Brain: The Origin of Adaptive Behaviour. Chapman &
Hall, London, 1952.
Ross Ashby. Introduction to Cybernetics. Chapman & Hall, London, 1956.
Dana H. Ballard, Mary M. Hayhoe, Polly K. Pook, and Rajesh P. N. Rao. Deictic
codes for the embodiment of cognition. Technical Report NRL95.1, National
Resource Lab. for the Study of Brain and Behavior, U. Rochester, 1996.
Richard Bellman. Adaptive Control Processes: A Guided Tour. Princeton University
Press, Princeton, NJ, 1961.
201
BIBLIOGRAPHY 202
Dimitri P. Bertsekas and John N. Tsitsiklis. An analysis of stochastic shortest path
problems. Mathematics of Operations Research, 16:580–595, 1991.
Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Struc-
tural assumptions and computational leverage. Journal of Artificial Intelligence
Research, 11:1–94, 1999.
Craig Boutilier and Richard Dearden. Using abstractions for decision-theoretic plan-
ning with time constraints. In Proceedings of the Twelfth National Conference on
Artificial Intelligence (AAAI-94), volume 2, pages 1016–1022, Seattle, Washing-
ton, USA, 1994. AAAI Press/MIT Press.
Rodney A. Brooks. Elephants don’t play chess. Robotics and Autonomous Systems
6, pages 3–15, 1990.
Andy Clark and Chris Thornton. Trading spaces: Computation, representation, and
the limits of uninformed learning. Behavioral and Brain Sciences, 20(1):57–66,
1997.
Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to
Algorithms. MIT Press, Cambridge Massachusetts, 1999.
Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. Advances in
Neural Information Processing Systems 5 (NIPS), 1992.
Edwin D. de Jong and Tim Oates. A coevolutionary approach to representation
development. Proceedings of the ICML-2002 Workshop on Development of Rep-
resentations, pages 1–6, 2002.
Thomas Dean and Robert Givan. Model minimization in markov decision processes.
In AAAI/IAAI, pages 106–111, 1997.
Thomas Dean and S. H. Lin. Decomposition techniques for planning in stochastic
BIBLIOGRAPHY 203
domains. Technical Report CS-95-10, Department of Computer Science Brown
University, 1995.
Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value
function decomposition. Journal of Artificial Intelligence Research, 13:227–303,
2000.
Thomas G. Dietterich and Ryszard S. Michalski. A comparative review of selected
methods for learning from examples. Machine Learning, pages 41–81, 1984.
Bruce L. Digney. Emergent hiearchical control structures: Learning reactive hier-
archical relationships in reinforcement environments. From Animals to Animats
4: Proceedings of the fourth international conference on simulation of adaptive
behaviour., pages 363–372, 1996.
Bruce L. Digney. Learning hierarchical control structures for multiple tasks and
changing environments. From animals to animats 5: Proceedings of the fifth in-
ternational conference on simulation of adaptive behaviour. SAB 98, 1998.
Chris Drummond. Accelerating reinforcement learning by composing solutions of
automatically indentified subtasks. Journal of Artificial Intelligence Research, 16:
59–104, 2002.
Sarah Finney, Natalia H. Gardiol, Leslie Pack Kaelbling, and Tim Oates. The
thing that we tried didn’t work very well: Deictic representation in reinforcement
learning. 18th International Conference on Uncertainty in Artificial Intelligence,
(UAI-02), 2002.
Sarah Finney, Natalia Gardiol Hernandez, Tim Oates, and Leslie Pack Kaelbling.
Learning in worlds with objects. Working Notes of the AAAI Stanford Spring
Symposium on Learning Grounded Representations, 2001.
BIBLIOGRAPHY 204
Mohammad Ghavamzadeh and Sridhar Mahadevan. Hierarchically optimal average
reward reinforcement learning. In Claude Sammut and Achim Hoffmann, edi-
tors, Proceedings of the Nineteenth International Conference on Machine Learn-
ing, pages 195–202. Morgan-Kaufman, 2002.
Michael Bonnell Harries, Claude Sammut, and Kim Horn. Extracting hidden con-
text. Machine Learning, 32(2):101–126, 1998.
J. Hartmanis and R.E. Stearns. Algebraic Structure Theory of Sequential Machines.
Prentic-Hall, 1966.
Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas Dean, and Craig
Boutilier. Hierarchical solution of Markov decision processes using macro-actions.
In Fourteenth Annual Conference on Uncertainty in Artificial Intelligence, pages
220–229, 1998.
Bernhard Hengst. Generating hierarchical structure in reinforcement learning from
state variables. In PRICAI 2000 Topics in Artificial Intelligence, pages 533–543,
San Francisco, 2000. Springer.
Bernhard Hengst. Discovering hierarchy in reinforcement learning with HEXQ.
In Claude Sammut and Achim Hoffmann, editors, Proceedings of the Nineteenth
International Conference on Machine Learning, pages 243–250. Morgan-Kaufman,
2002.
Bernhard Hengst, Darren Ibbotson, Son Bao Pham, John Dalgliesh, Mike Lawther,
Phil Preston, and Claude Sammut. The UNSW robocup 2000 Sony legged league
team. In Peter Stone, Tucker Balch, and Gerhard Kraetzschmar, editors, RoboCup
2000: Robot Soccer World Cup IV, volume 2019 of Lecture Notes in Artificial
Intelligence subseries of Lecture Notes in Computer Science, chapter Champion
Teams, pages 64–75. Springer-Verlag, Heidelberg, 2001.
BIBLIOGRAPHY 205
Natalia Hernandez and Sridhar Mahadevan. Hierarchical memory-based reinforce-
ment learning. Fifteenth International Conference on Neural Information Pro-
cessing Systems, Nov. 27-December 2nd 2000. Denver.
John H. Holland. Hidden Order: How Adaptation Builds Complexity. Helix books.
Addison-Wesley, Reading Massachusetts, 1995.
David H. Hubel and Torsten N. Wiesel. Brain mechanisms of vision. A Scientific
American Book: the Brain, pages 84–96, 1979.
Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: Preliminary
results. In Machine Learning Proceedings of the Tenth International Conference,
pages 167–173, San Mateo, CA, 1993. Morgan Kaufmann.
Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement
learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
Craig A. Knoblock. Automatically generating abstractions for planning. Artificial
Intelligence, 68(2):243–302, 1994.
Terran Lane and Leslie Pack Kaelbling. Nearly determinisitc apbstractions of
markov decision processes. In Eighteeth National Conference on Artificial In-
telligence (AAAI-02), pages 260–266, 2002.
R. Sh. Liptser, W. J. Runggaldier, and M. Taksar. Deterministic approximation for
stochastic control problems. SIAM Journal on Control and Optimization, 34(1):
161–178, 1996. Society for Industrial and Applied Mathematics.
Andrew McCallum. Reinforcement learning with selective perception and hidden
state. PhD thesis, Department of Computer Science, University of Rochester,
1995.
Amy McGovern and Richard S. Sutton. Macro-actions in reinforcement learning: An
BIBLIOGRAPHY 206
empirical analysis. Amherst technical report 98-70, University of Massachusetts,
1998.
Amy E. McGovern. Autonomous Discovery of Temoral Abstraction from Interac-
tion with an Environment. PhD thesis, University of Massachusetts, Amherst,
Massachusetts, 2002.
Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-Cut: Dynamic discovery of
sub-goals in reinforcement learning. volume 2430 of Lecture Notes in Computer
Science. Springer, 2002.
Donald Michie. On Machine Intelligence. Ellis Horwood Limited, Chichester, second
edition, 1986.
Donald Michie and R. A. Chambers. BOXES: An experiment in adaptive control.
Number 2, pages 137–152, Edinburgh, 1968. Oliver and Boyd.
Marvin Minsky. The Society of Mind. Simon and Schuster, New York, 1985.
Andrew Moore, Leemon Baird, and Leslie Pack Kaelbling. Multi-value-functions:
Efficient automatic action hierarchies for multiple goal mdps. In Proceedings of
the International Joint Conference on Artificial Intelligence, Stockholm, pages
1316–1323, 340 Pine Street, 6th Fl., San Francisco, CA 94104, 1999. Morgan
Kaufmann.
Andrew W. Moore. The parti-game algorithm for variable resolution reinforcement
learning in multidimensional state-spaces. In Jack D. Cowan, Gerald Tesauro, and
Joshua Alspector, editors, Advances in Neural Information Processing Systems,
volume 6, pages 711–718. Morgan Kaufmann Publishers, Inc., 1994.
Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforce-
ment learning with less data and less time. Machine Learning, 13:103–130, 1993.
BIBLIOGRAPHY 207
Craig G. Nevill-Manning. Infering Sequential Structure. PhD thesis, University
Waikato, 1996.
Nils J. Nilsson. Teleo-reactive programs for agent control. Journal of Artificial
Intelligence Research, 1:139–158, 1994.
Andre Olave, David Wang, James Wong, Timothy Tam, Benjamin Leung, Min Sub
Kim, James Brooks, Albert Chang, Nik Von Huben, and Claude Sammu-
tand Bernhard Hengst. The unsw robocup 2002 legged league team. Workshop
on Adaptability in Multi-Agent Systems: The First RoboCup Australian Open
(AORC-2003), 2003.
Ronald E. Parr. Hierarchical Control and learning for Markov decision processes.
PhD thesis, University of California at Berkeley, 1998.
Ronald E. Parr. Optimality and HAMs. personal communication, March 2002.
Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, San Francesco, revised second printing edition,
1988.
Joelle Pineau, Nicholas Roy, and Sebastian Thrun. A hiearchical approach to
POMDP planning and execution. In Workshop on Hierarchy and Memeory in
Reinforcement Leanring, ICML-2001, 2001.
Duncan Potts and Bernhard Hengst. Concurrent discovery of task hierarchies. In
Proceedings of the AAAI Spring Symposium on Knowledge Representation and
Ontology for Autonomous Systems, pages 17–24, 2004.
Doina Precup. Temporal Abstraction in Reinforcement Learning. PhD thesis, Uni-
veristy of Massachusetts, Amherst, 2000.
William H. Press, Brian B. Flannery, Saul A. Teukolsky, and William T Vetterling.
Numerical Recipes in C. Cambridge University Press, 1988.
BIBLIOGRAPHY 208
Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic
Programming. John Whiley & Sons, Inc, New York, NY, 1994.
Balaraman Ravindran and Andrew G. Barto. Model minimization in hierarchical
reinforcement learning. In Fifth Symposium on Abstraction, Reformulation and
Approximation (SARA 2002). Springer Verlag, 2002.
Balaraman Ravindran and Andrew G. Barto. SMDP homomorphisms: An alge-
braic approach to abstraction in semi markov decision processes. In To appear
in the Proceedings of the Eighteenth International Joint Conference on Artificial
Intelligence (IJCAI 03), 2003.
Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pren-
tice Hall, Upper Saddle River, NJ, 1995.
Malcolm R. K. Ryan. Hierarchical Reinforcement Learning: A Hybrid Approach.
PhD thesis, Computer Science and Engineering, Univeristy of New South Wales,
2002.
Malcolm R. K. Ryan and Mark D. Reid. Using ilp to improve planning in hierarchical
reinforcement learning. Proceedings of the Tenth International Conference on
Inductive Logic Programming, 2000.
Claude A Sammut. Learning Concepts by Performing Experiments. PhD thesis,
Department of Computer Science, University of New South Wales, 1981.
Juan C. Santamaria, Richard S. Sutton, and Ashwin Ram. Experiments with rein-
forcement learning in problems with continuous state and action spaces. Adaptive
Behavior, 6(2), 1998.
Alen D. Shapiro. Structured Induction in Expert Systems. Turing Institute Press in
association with Addison-Wesley, Workingham, England, 1987.
BIBLIOGRAPHY 209
Sidney Siegel and Jr N. John Castellan. Nonparametric Statistics for the Behavioural
Sciences. McGraw-Hill, New York, 1988.
Herbert A. Simon. The Sciences of the Artificial. MIT Press, Cambridge, Mas-
sachusetts, 3rd edition, 1996.
Satinder Singh. Reinforcement learning with a hierarchy of abstract models. In
Proceedings of the Tenth National Conference on Artificial Intelligence, 1992.
Peter Stone. Layered Learning in Multi-Agent Systems. Phd, School of Computer
Science, Carnegie Mellon University, Pittsburgh, PA, December 1998.
David L. Streiner. Maintaining standards: differences between the standard devia-
tion and standard error, and when to use each. Can J Psychiatry, 48(8):498–502,
1996.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.
MIT Press, Cambridge, Massachusetts, 1998.
Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between MDPs and semi-
MDPs: A framework for temporal abstraction in reinforcement learning. Artificial
Intelligence, 112(1-2):181–211, 1999a.
Richard S. Sutton, Satinder Singh, Doina Precup, and Balaraman Ravindran. Im-
proved switching among temporally abstract actions. In D. A. Cohn M. S. Kearns,
S. A. Solla, editor, Advances in Neural Information Processing Systems 11. MIT
Press, 1999b.
Georgios Theocharous and Leslie Pack Kaelbling. Approximate planning in
POMDPS with macro-actions. Advances in Neural Information Processing Sys-
tems 16 (NIPS-03), 2004. to appear.
Sebastian Thrun and Anton Schwartz. Finding structure in reinforcement learning.
BIBLIOGRAPHY 210
In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Informa-
tion Processing Systems (NIPS) 7, Cambridge, MA, 1995. MIT Press.
Paul E. Utgoff and David J. Stracuzzi. Many-layered learning. In Neural Computa-
tion. MIT Press Journals, 2002.
William T. B. Uther. Tree Based Hierarchical Reinforcement Learning. PhD thesis,
Computer Science, Carnegie Mellon University, 2002.
Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s
College, 1989.
Christopher J. C. H. Watkins and Peter Dayan. Technical note: Q-learning. Machine
Learning, 8:279–292, 1992.
Shimon Whiteson and Peter Stone. Concurrent layered learning. The Second Inter-
national Joint Conference on Autonomous Agents and Multiagent Systems, 2003.
to appear.
Appendix A
Mathematical Review
This appendix provides a cursory review of the algebra and statistics referred to in
this thesis.
A.1 Partitions and Equivalence Relations
HEXQ partitions the state space into equivalence classes of regions and constructs
a hierarchy of progressively coarser partitions.
A set G = {g1, . . . , gm} is called a partition of a state space S if S = ∪mi=1gi and
gi∩gj = ∅ for all i 6= j. Region g ∈ G is also referred to as a block or, in the current
context, an aggregate state. A state s ∈ S is referred to as a base-level state.
A partition G′ is a refinement of partition G if each block of G′ is a subset of
some block of G. Conversely G is said to be coarser than G′. A quotient partition
is a partition of the blocks of a partition. The quotient partition of G with respect
to G′ is a partition of the blocks of G′ where two blocks are identified if and only if
they are contained in the same block of G (Hartmanis and Stearns, 1966).
A binary relation B on set G is a subset of the Cartesian product G × G. If
(gi, gj) ∈ B the binary relation can be written giBgj where gi, gj ∈ G. When B is
reflexive, symmetric and transitive, that is
• giBgi,
• giBgj ⇔ gjBgi and
211
Mathematical Review 212
• giBgj, gjBgk ⇒ giBgk for all gi, gj, gk ∈ G,
then B is an equivalence relation. If B is an equivalence relation on set G then for
gi ∈ G, the equivalence class of gi is the set [gi] = {gj ∈ G|giBgj}.
A.2 Directed Graphs
A directed graph G is a pair (V, E), where V is a finite set of vertices and E is a set
of directed edges. E is a binary relation on V .
An MDP can be represented as a directed graph in which the states si ∈ S, i =
1, . . . , |S| are the vertices and the state transitions are directed edges. An edge exists
from state s to s′ whenever the probability of transition in a single step from s to
the s′ is greater than zero for some action a ∈ A (i.e., T ass′ > 0).
A directed graph can be decomposed into strongly connected components (SCC)
using two depth-first searches. The linear time algorithm ( O(V+E) ) shown in
table A.1 is used by HEXQ in Chapter 5 to find Markov regions. It is adapted from
Cormen et al. (1999). It takes as input the adjacency matrix adj[s][s′] signifying that
state s may transition to state s′ where s, s′ ∈ S. It outputs the strongly connected
component label of each state s, i.e. SCC[s] and the number of strongly connected
components in the graph, SCClabel, on return.
Mathematical Review 213
Table A.1: Function SCC finds strongly connected components of a directed graphrepresenting transitions between states. An edge from s to s’, s, s′ ∈ S, is representedby adj[s][s′]. SCC[s] is the integer label of the SCC for state s.
function SCC( states S, adj[s][s′] )
1. initialise finT ime ← 0
2. initialise SCClable ← 0
3. for each state s ∈ S
4. initialise col[s] ← WHITE
5. initialise f [s] ← undefined
6. initialise SCC[s] ← undefined
7. for each state s ∈ S if (col[s] = WHITE) DFS1(s)
8. for each state s ∈ S col[s] ← WHITE
9. for each state s ∈ S in order of decreasing f [s]
10. if(col[s] = WHITE) DFS2(s)
11. increment SCClabel
12. return SCC[·], SCClabel
DFS1(s)
13. col[s] ← GRAY
14. increment finT ime
15. for each state s′ ∈ S if (adj[s][s′] and col[s′] = WHITE) DFS1(s’)
16. col[s] ← BLACK
17. f [s] ← finT ime
18. increment finT ime
19. return
DFS2(s)
20. col[s] ← GRAY
21. for each state s′ ∈ S if (adj[s′][s] and col[s′] = WHITE) DFS2(s’)
22. col[s] ← BLACK
23. SCC[s] ← SCClabel
24. return
end SCC
Mathematical Review 214
A.3 Statistical Tests
In this thesis non-parametric statistical tests are used to assist in the partitioning
of a multi-dimensional state space.
The Kolmogorov-Smirnov (K-S) test is designed to test whether two independent
samples come from the same probability distribution (Siegel and N. John Castellan,
1988). The test is sensitive to the difference between the cumulative probability
distributions for the two samples. If the two samples are drawn from the same
distribution then their cumulative probability distributions can be expected to be
close to each other. A large deviation is evidence for the two samples coming from
separate distributions. The test requires that each set of samples is arranged into a
cumulative frequency distribution. For real values this implies sorting the samples by
the magnitude of their values. The maximum difference between the two cumulative
probability distributions is taken as a guide to estimate whether the two samples
come from the same distribution.
The K-S test will be used to determine whether the primitive rewards for a
transition from state s to s′ given action a is stationary. To estimate this, two
separate samples of rewards for this transition, taken at different time periods, will be
compared to see if they come from the same probability distribution. The algorithm
for the K-S test is given by McCallum (1995) who adapted it from “Numerical
Recipes in C” (Press et al., 1988). This test was successfully used by McCallum
(1995) in deciding to split states in the UTree algorithm based on the significance
of the hypothesis that states in the fringe generated a different value function from
leaf node states.
The Binomial Test is appropriate when a population consists of two classes, say
0 and 1. If the probability of sampling an instance of class 1 is p then the probability
of sampling class 0 is q = 1− p. For a sample size of N , the probability that there
are at least k instances from class 1 is
Mathematical Review 215
Pr[Y ≥ k] =N∑
i=k
(N
i
)piqN−i (A.1)
where (N
k
)=
N !
k!(N − k)!. (A.2)
The value of p may vary from population to population. If the null hypothesis
is that p has a particular value, then it is possible to test whether it is reasonable to
believe that a sample comes from this distribution at a specific level of significance.
The two tailed test is used to determine whether the probability of transitioning
from state s to s′ given action a, T ass′ , is stationary. The long term probability, p,
is estimated from the frequencies of transition from s to s′. A sample of transitions
from a particular time period is then tested to see if they could reasonably have
come from this distribution. If it fails the hypothesis is rejected and the transition
is declared non-stationary.
Appendix B
Significance Values in Graphs
B.1 Error Bars and Confidence Intervals
95% Confidence Interval
-4-3.5
-3-2.5
-2-1.5
-1-0.5
00.5
1
0
1000
0
2000
0
3000
0
4000
0
5000
0
6000
0
7000
0
8000
0
9000
0
1000
00
Time Steps
Ave
rag
e R
ewar
d p
er T
ime
Ste
p
Figure B.1: The 95% confidence interval in the the mean performance of the stochas-tic taxi using a flat reinforcement learner.
For the empirical evaluations in Chapter 6 each of the experiments was averaged
over 100 runs. Error bars have been omitted in the graphs in this Chapter to avoid
clutter. As an indication of the variation in results, the 95% confidence interval is
216
Significance Values in Graphs 217
plotted here for two of the experiments. We are interested in the confidence of the
reported mean that is assumed to be normally distributed.
Figure B.1 shows the 95% confidence interval for the mean of the stochastic taxi
results using a flat reinforcement learner from section 6.1.2. The confidence interval
of the mean was estimated by taking two standard deviations above and below the
mean of 100 sample experiments. Each experiment consisted of 100 runs.
Standard Error x 2 (HEXQ on Taxi)
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
010
0020
0030
0040
0050
0060
0070
0080
0090
00
1000
0
Time Steps
Ave
rag
e R
ewar
d P
er T
ime
Ste
p
Figure B.2: The 95% confidence interval (2 times SE) for the stochastic taxi perfor-mance for HEXQ excluding construction.
Figure B.2 calculates the variation in the mean using the standard deviation of
the underlying data. It shows the 95% confidence interval using error bars for the
initial time steps of the stochastic taxi performance results for HEXQ excluding
construction from section 6.1.2. The confidence interval was estimated by taking
twice the standard error (Streiner, 1996) above and below the mean over 100 runs .
The analysis above shows that the variance in the expected means over 100 runs
is low enough to make the comparison between the graphs meaningful.
Significance Values in Graphs 218
B.2 Taxi Learning Rate and Convergence
The temporal difference learning rates in this thesis have been kept constant, because
it was felt that efficient convergence of sub-MDPs is largely an independent issue
from the automation of hierarchy discovery. This way the task of tuning the learning
rate was avoided.
Watkins and Dayan (1992) proved that reinforcement learning will converge if
all state action pairs are visited infinitely often and the learning rate βi for the i th
update for every state-action pair (s, a) meets the conditions
∞∑i=1
βi = ∞ (B.1)
and∞∑i=1
β2i < ∞. (B.2)
With a constant learning rate, it is instructive to understand how the learning
rate value effects the convergence for the stochastic taxi problem to help interpret
the results.
If an agent in state s takes action a, receives a stochastic reward r and moves to
the next state s’, then the Q action value function is update with the training rule:
Q(s, a) ← (1− βi)Q(s, a) + βi[r + γ maxa′
Q(s′, a′)] (B.3)
The graph in figure B.3 below gives an appreciation of the variation in perfor-
mance for the stochastic taxi task from section 6.1.2 for various fixed settings of the
learning rate.
As the learning rate is decreased, convergence takes longer and performance
improves. This is not surprising. Large learning rates move the the estimate Q values
more quickly towards their optimal values but the estimates are then nudged around
by the stochastic nature of the experience, resulting in a poorer performance as some
greedy actions become sub-optimal. Smaller learning rates move estimates more
Significance Values in Graphs 219
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.4
0.41
0.42
0 100000 200000 300000 400000
Primitive time Steps
Mo
vin
g a
vera
ge
rew
ard
per
tim
e st
ep
0.040.05
0.1
0.2
0.25
0.3
Figure B.3: The performance of the converged policy improves with a smaller learn-ing rate parameters β with a flat reinforcement learner in Taxi domain with stochas-tic actions.
slowly towards the optimal values, in a less volatile fashion as each new experience
makes only small adjustments. There is hence less risk that Q values drift sufficiently
far to effect the optimal policy.
The learning rate of 0.25 was chosen for its rapid approach to the correct Q
function. The only significance is that when results between HEXQ and a flat
learner are compared, one reason they may not appear to converge exactly to the
same function values, as for example in figure 6.7, is the above effect due to fixing
the learning rate.
List of Figures
1.1 A maze showing three rooms interconnected via doorways. Each room
has 25 positions. The aim of the robot is to reach the goal. . . . . . . 4
1.2 The maze showing the cost to reach the goal from any location. . . . 5
1.3 The maze, decomposed into rooms, showing the number of steps to
reach each room exit on the way to the goal. . . . . . . . . . . . . . . 5
1.4 The maze, abstracted to a reduced model with only three states, one
for each room. The arrows indicate transitions for the abstract actions
that are the policies for leaving a room to the North, South, East and
West. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 An agent interacting with its environment and receiving a reward signal. 13
2.2 Episodic MDP showing the transition on termination to a hypothet-
ical absorbing state . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Ashby’s 1952 depiction of a gating mechanism to accumulate adap-
tions for recurrent situations. . . . . . . . . . . . . . . . . . . . . . . 28
3.2 The maze from Chapter 1 reproduced here for convenience. . . . . . . 31
4.1 A simple example showing three rooms interconnected via doorways.
Each room has 25 positions. The aim of the agent is to reach the goal. 47
220
LIST OF FIGURES 221
4.2 For transition (x3, y0) →b (x2, y1) the y variable changes value. As
this is a variant transition, ((x3, y0), b) is an exit. If all states were
in the one region, then entry state (x0, y0) cannot reach exit state
(x3, y1) using non-exit actions. Two regions are therefore necessary
to meet the HEXQ partition requirements. . . . . . . . . . . . . . . . 51
4.3 In this example all states have the same y value. If all states are in the
one region, then entry state (x5, y0) cannot reach exit state (x3, y0).
Therefore, two regions are necessary to meet the HEXQ partition
requirements. The inter-block transition means that ((x3, y0), b) is an
exit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 The transitions (x0, y0) →b (x1, y0) and (x0, y1) →b (x1, y1) have dif-
ferent associated rewards and hence give rise to exits ((x0, y0), b) and
((x0, y1), b) by condition 2 of variant transitions. . . . . . . . . . . . . 52
4.5 HEXQ partitioning of the maze in figure 4.1. The state representation
effects the partitioning. In (a) the agent uses a position in room and
room sensor resulting in three regions (b). In (c) the agent uses a
coordinate like sensor, that partitions the state space into the 15
regions shown in (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 The HEXQ tree for the simple maze showing the top level semi-MDP
and the 12 sub-MDPs, 4 for each region. The numbers shown for the
sub-MDP are the position-in-room variable values. . . . . . . . . . . . 57
4.7 An example trajectory under policy π, for N = 4 steps, where sub-
MDP m invokes sub-MDP ma using abstract action a, showing the
sum of primitive rewards to the exit state sa and the primitive reward
on exit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 The value of state (3, 0) is composed of two HEXQ E values. . . . . . 62
4.9 A region with two exits, where the HEXQ decomposition misses a
potentially low cost exit policy from the region. . . . . . . . . . . . . 64
LIST OF FIGURES 222
4.10 For the region in figure 4.9 the optimal policies for the two sub-MDPs
created by HEXQ (one for each exit) are shown in (a) and (b). The
optimal policy to use either exit, shown in (c), is not available to
HEXQ by construction. . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.11 The maze HEXQ graph with sub-MDPs represented compactly . . . . 70
4.12 The simple maze HEXQ graph with top level sub-MDP abstracted. . 72
4.13 The multi-floor maze . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.14 The HEXQ tree of sub-MDPs generated from the multi-floor maze . . 74
4.15 The HEXQ directed acyclic graph of sub-MDPs derived from the
HEXQ tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 The simple maze example introduced previously. The invariant sub-
space regions are the rooms. The lower half shows four sub-MDPs,
one for each possible room exit. The numbers are the position-in-
room variable values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 The Markov Equivalent region for the maze example showing the four
possible exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 The maze example with a corner obstacle and an expensive transition
in room 0 giving rise to non-stationary transitions from the perspec-
tive of the location-in-room variable. . . . . . . . . . . . . . . . . . . 86
5.4 All actions in this example are assumed to have some probability of
transitioning to adjacent states. Part (a) illustrates two such actions
near doorways. Function Regions breaks a room iteratively into single
state MERs. The results of the first three iterations are shown as
parts (b), (c) and (d). . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Four SCCs joined to form two MERs . . . . . . . . . . . . . . . . . . 92
5.6 Two exits of a level 2 MER that requires 2 separate level 2 sub-MDPs
even though both exits have the same level 2 exit state, S21 . . . . . . . 94
LIST OF FIGURES 223
6.1 The Taxi Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 The directed graph of projected state transitions for the taxi location.
Exits are not counted as edges of the graph. . . . . . . . . . . . . . . 107
6.3 The four level 1 sub-MDPs for the taxi domain, one for each hierar-
chical exit state, constructed by HEXQ. . . . . . . . . . . . . . . . . 108
6.4 State transitions for the passenger location variable at level 2 in the
hierarchy. There are 4 exits at level 2.. . . . . . . . . . . . . . . . . . 109
6.5 The top level subMDP for the taxi domain showing the abstract ac-
tions leading to the goal. . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 The HEXQ graph showing the hierarchical structure automatically
generated by the the HEXQ algorithm. . . . . . . . . . . . . . . . . . 112
6.7 Performance of HEXQ with and without hierarchy construction against
MAXQ and a flat MDP for the stochastic taxi. . . . . . . . . . . . . . 113
6.8 Performance of the stochastic taxi with 4 variables compared to the
three variable representation. . . . . . . . . . . . . . . . . . . . . . . 117
6.9 Taxi domain with four variables, (a) x and y coordinates for the taxi
location, (b) the y variable MER for deterministic actions, (c) the
three MERs for deterministic actions when the x variable is forced to
level 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.10 Performance of the taxi with a fickle passenger compared to the orig-
inal decisive passenger. HEXQ results do not include the time-steps
required for construction of the HEXQ graph. . . . . . . . . . . . . . 120
6.11 The Taxi with Fuel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.12 The MER in the taxi with fuel problem showing the taxi location ×fuel level state space and some example transitions. . . . . . . . . . . 124
6.13 Performance of the taxi with fuel averaged over 100 runs using stochas-
tic actions and a two variable state vector. HEXQ attains optimal
performance after it is switched to hierarchical greedy execution . . . 125
LIST OF FIGURES 224
6.14 HEXQ partitioning of an MDP with 25 Rooms. The numbers indicate
the values for the position-in-room variable. . . . . . . . . . . . . . . 128
6.15 The optimal value for each state after HEXQ has discovered the one
way barrier constructed in the room containing the goal. The barrier
was constructed in two separate ways (1) by using a border that
only allows transitions North and (2) a virtual barrier by imposing a
reward of -100 for transitioning South. Both barriers produced similar
value functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.16 The Tower of Hanoi puzzle with seven discs showing (a) the start
state, (b) an intermediate state and (c) the goal state. . . . . . . . . . 131
6.17 The directed graph for level 1 in the decomposition of the Tower of
Hanoi puzzle showing one MER and six exits. . . . . . . . . . . . . . 132
6.18 The directed graph for level 2 in the decomposition of the Tower of
Hanoi puzzle showing one MER and six exits. . . . . . . . . . . . . . 134
6.19 The performance comparison of a flat reinforcement learner and HEXQ
on the Tower of Hanoi puzzle with 7 discs. . . . . . . . . . . . . . . . 135
6.20 The directed graph for level 1 in the decomposition of the stochastic
Tower of Hanoi puzzle showing one MER and 12 exits. . . . . . . . . 137
6.21 A simulated stylised bipedal agent showing its various stances. . . . . 138
6.22 The four stances (left) that comprise a successful traversal of a hexag-
onal cell (right). Each of the six directions has 4 associate positions
across the cell. One set is illustrated. . . . . . . . . . . . . . . . . . . 139
6.23 The stylized soccer field illustrating the stochastic nature of the soccer
ball. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.24 The deictic representation of the location of the ball relative to the
agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
LIST OF FIGURES 225
6.25 The HEXQ graph for the ball kicking domain. Level 1 regions learn
to “walk”, level 2 regions learn to kick the ball and the top level learns
to kick goals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.1 For a discounted value function, the amount of discount applied after
exiting the sub-MDP depends on the number of steps required to
reach the exit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 A simple example showing the state abstracted decomposition of a
discounted value function. (a) shows a 2-dimensional MDP with two
identical regions with one exit to the right. The deterministic actions
are move left and right, all rewards are -1. The reward on termination
is 20. The discount factor is 0.9. (b) the composed value function
for each state. (c) and (d) are the abstracted sub-MDP value and
discount functions respectively. . . . . . . . . . . . . . . . . . . . . . 154
7.3 The infinite horizon taxi task. The graph shows that HEXQ finds and
switches policies similarly to that of a flat reinforcement learner for
various values of positive reward at $. As the reward at $ increases
the taxi stops delivering passengers and instead visits the $ location
as frequently as possible. This provides confirming evidence for the
correct operation of HEXQ for an infinite horizon problem, even when
the optimum solution requires continuing in the non-exit navigation
sub-MDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4 The soccer player showing (a) the simulated robot leg positions, (b)
400 discrete ball locations on the field, (c) the discounted value of
states in the level 2 no-exit sub-MDP when the robot is rewarded
for running around the ball and (d) a snapshot of the robot running
around the ball. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
LIST OF FIGURES 226
8.1 The plan view of two out of ten floors connected via two banks of
elevators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.2 The simple room example with wider doorways. . . . . . . . . . . . . 171
8.3 Kaelbling’s 10×10 navigation maze. The regions represent a Voronoi
partition given the circled landmarks. . . . . . . . . . . . . . . . . . . 174
8.4 Performance of HEXQ on stochastic version of Kaelbling’s maze.
Steps per trial are collected in buckets of 10 samples over 10 runs.
The trend lines are 255 point moving averages. . . . . . . . . . . . . . 176
9.1 A simple navigation problem to leave a room. The agent uses one
step moves in 4 compass directions to reach the goal state from any
starting position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.2 New hierarchical structure for the simple navigation problem using
multiple partitions at level 1. . . . . . . . . . . . . . . . . . . . . . . . 183
9.3 A partially observable environment with two independent tasks and
2-dimensional actions. . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.4 The expected result of an automatic decomposition of the problem in
figure 9.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.5 The Sony AIBO robot showing an image from its colour camera. . . . 195
B.1 The 95% confidence interval in the the mean performance of the
stochastic taxi using a flat reinforcement learner. . . . . . . . . . . . 216
B.2 The 95% confidence interval (2 times SE) for the stochastic taxi per-
formance for HEXQ excluding construction. . . . . . . . . . . . . . . 217
B.3 The performance of the converged policy improves with a smaller
learning rate parameters β with a flat reinforcement learner in Taxi
domain with stochastic actions. . . . . . . . . . . . . . . . . . . . . . 219
List of Tables
2.1 Action-Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 The HEXQ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Frequency of change for the rooms example variables over 2000 ran-
dom steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Function Regions finds all the Markov Equivalent Regions (MERs)
at level e given a directed graph for a state space Se, Exits(Se) and
Entries(Se) sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Procedure for evaluating the optimal value of a hierarchical state
in a HEXQ graph. The function returns the optimal value of the
hierarchical state s based on the optimal policy for each sub-MDP
and finds the best greedy action at every level up to e . . . . . . . . 96
5.5 Function Execute solves sub-MDP m associated with abstract action
a at level e. The state s is the current hierarchical state at each level
depending on context in which it is used. lse is the last projected
state at level e. The learning rate β is set to a constant value. The
E tables are originally initialised to 0. . . . . . . . . . . . . . . . . . . 99
6.1 The HEXQ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Frequency of change for taxi domain variables over 2000 random steps.105
227
LIST OF TABLES 228
6.3 Storage requirements in terms of the number of table entries for the
value function for the flat MDP, MAXQ, HEXQ, HEXQ with stochas-
tic actions after eliminating no-effect actions and HEXQ with deter-
ministic actions after eliminating no-effect actions. . . . . . . . . . . . 115
6.4 The number of E action-value table entries required to represent the
decomposed value function for the soccer player compared to the the-
oretical number of Q values required for a flat learner. . . . . . . . . . 143
A.1 Function SCC finds strongly connected components of a directed
graph representing transitions between states. An edge from s to
s’, s, s′ ∈ S, is represented by adj[s][s′]. SCC[s] is the integer label
of the SCC for state s. . . . . . . . . . . . . . . . . . . . . . . . . . . 213