Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Title: Automatic State Construction using Decision Trees for Reinforcement Learning Agents
Manix Au
F.I.T. / C.I.T.I.
2004
Master in I.T. (Research)
The Statement of original authorship The work contained in this thesis has not been previously submitted for a degree or
diploma at any higher education institution. To the best of my knowledge and belief, the
thesis contains no material previously published or written by another except where due
reference is made.
Signature:
Date:
1
Key words
Reinforcement learning
State
Action
Reward
Policy
Value based method
Policy search method
Direct gradient method
Automatic state construction
Decision tree
Partial observability
U-Tree
Kolmogorov-Smirnov two sample test
Information gain ratio test
2
Table of Content ABSTRACT 6 ACKNOWLEDGEMENT 7 INTRODUCTION TO THE THESIS 8
REINFORCEMENT LEARNING 8 SCOPE OF THE STUDY 8 OVERVIEW 9 ORIGINAL CONTIRBUTIONS 10
1 INTRODUCTION TO REINFORCEMENT LEARNING 11 CHAPTER SUMMARY 11 1.1 WHAT IS REINFORCEMENT LEARNING? 11 1.2 THE ARCHITECTURE OF A RL SYSTEM 12
1.2.1 The agent and the environment 12 1.2.2 Overview of the agent-environment framework 13 1.2.3 The learning structures of a typical RL problem 13
1.3 DIFFERENT TYPES OF THE ENVIRONMENT 14 1.3.1 The nature of the state space of the environment 14 1.3.2 The observability of the environment 14 1.3.3 The availability of a model of the environment 15
1.4 THE LEARNING ASPECT 16 1.4.1 The policy, the return and the value function 16 1.4.2 The key to learning 18 1.4.3 An example: A finite grid environment 21 1.4.4 Relationship between the optimal value functions and the optimal policy 23 1.4.5 The different techniques in solving RL problems 24
1.5 ANCILLARY ISSUES 35 1.5.1 Exploration–exploitation dilemma 35 1.5.2 Temporal credit assignment 35 1.5.3 Shaping 36
1.6 RL APPLICATIONS 37 1.6.1 Cellular communication system 37 1.6.2 Others 37
1.7 RESEARCH CHALLENGES IN RL 38 1.7.1 Scaling up to large problems 38 1.7.2 Partial observability of the environment 39
1.8 AUTOMATIC STATE CONSTRUCTION 40 2 U-TREE: A RL ALGORITHM WITH AUTOMATIC STATE CONSTRUCTION FUNCTIONALITY 41
CHAPTER SUMMARY 41 2.1 INTRODUCTION TO THE U-TREE ALGORITHM 41
2.1.1 Problem domain targeted by the U-Tree 42 2.1.2 The architecture of a U-Tree 42 2.1.3 Feature extraction 46
2.2 CONSTRUCTION OF A U-TREE 49 2.2.1 The U-Tree learning algorithm 49 2.2.2 Using the tree 50 2.2.3 Improving the tree 50
2.3 SHORTCOMINGS 52 2.3.1 Circular dependency 52 2.3.2 Pre-selection of the candidate feature 52
3 PROPOSED METHOD AND RESULTS 53 CHAPTER SUMMARY 53 3.1 INTRODUCTION TO THE VARIANT OF THE U-TREE METHOD 53
3.1.1 Information Gain Ratio test 53 3.1.2 Procedure of the IGR test 54
3
3.2 CONSTRUCTION OF A U-TREE BY IGR TEST 55 3.3 EXPERIMENTAL RESULTS 56
3.3.1 Using the Analysis of Variance statistical test 57 3.3.2 A robot soccer problem 59 3.3.3 A New York driving problem 71 3.3.4 An elevator control problem 80
4 CONCLUSION 86 5 CONTRIBUTIONS 87 6 REFERENCES 88
4
Illustrations and diagrams Figure 1: A finite grid problem ...........................................................................................................21 Figure 2: Value function under random policy of a finite grid problem .........................................21 Figure 3: Value function under optimal policy of a finite grid problem..........................................22 Figure 4: Optimal policy diagram of a finite grid problem ..............................................................22 Figure 5: Example of return realization in training history .............................................................32 Figure 6: State transition dynamic under POMDP...........................................................................39 Figure 7: A decision tree for playing tennis according to the sky outlook.......................................43 Figure 8: A U-Tree for a soccer agent to get to a ball .......................................................................44 Figure 9: Partition of a state including returns during leaf expansion............................................46 Figure 10: Probability density function of the 2χ distribution.......................................................48 Figure 11: A snap shot of the robot navigation problem ..................................................................59 Figure 12: Robot navigation success rate per episode by K-S test ...................................................64 Figure 13: Robot navigation success rate per episode by IGR test ..................................................65 Figure 14: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes in the robot navigation problem ..................................................66 Figure 15: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes at the internal state refinement frequency = 5............................67 Figure 16: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes at the internal state refinement frequency = 4............................68 Figure 17: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes at the internal state refinement frequency = 3............................68 Figure 18: The coefficient of variation for the K-S statistics during the first feature extraction
process in the robot navigation problem...................................................................................69 Figure 19: The coefficient of variation for the IGR during the first feature extraction process in
the robot navigation problem.....................................................................................................70 Figure 20: A snap shot of the New York driving problem................................................................71 Figure 21: Average accident counts per every 2000 steps for a K-S test U-Tree ............................75 Figure 22: Average accident counts per every 2000 steps for a IGR test U-Tree ...........................76 Figure 23: Average honk count of the K-S and IGR test per every 2000 steps...............................77 Figure 24: Average scrap count of the K-S and IGR test per every 2000 steps ..............................77 Figure 25: The K-S statistics for the first feature extraction process in the New York driving
problem........................................................................................................................................78 Figure 26: The IGR for the first feature extraction process in the New York driving problem ...78 Figure 27: A snap shot of the elevator control problem....................................................................80 Figure 28: Average waiting time of passengers under the K-S test U-Tree over 15 trials .............84 Figure 29: Average waiting time of passengers under the IGR test U-Tree over 15 trials ............85
5
Tables Table 1: Effect of discount factor on return.......................................................................................17 Table 2: An example of a history table before update.......................................................................44 Table 3: An example of a history table after update .........................................................................45 Table 4: Data table for a one-way ANOVA test.................................................................................57 Table 5: A U-Tree performance data table for a one-way ANOVA test..........................................57 Table 6: Data table for a two-way ANOVA test ................................................................................58 Table 7: The actions of the soccer agent.............................................................................................60 Table 8: The set of observations in the robot soccer problem ..........................................................61 Table 9: The set of parameters used in the robot soccer problem ...................................................62 Table 10: The actions of the driving agent .........................................................................................72 Table 11: The set of observations in the New York driving problem ..............................................73 Table 12: The set of parameters used in the New York driving problem........................................74 Table 13: The actions of an individual elevator .................................................................................81 Table 14: Examples of the central control action ..............................................................................81 Table 15: The set of observation in the elevator control problem....................................................82 Table 16: The set of parameters used in the elevator control problem............................................83
Equations Equation 1: Markov property .............................................................................................................18 Equation 2: Markov assumption on the state transition and reward ..............................................18 Equation 3: Action value function ......................................................................................................19 Equation 4: State value function .........................................................................................................19 Equation 5: Optimal state value function...........................................................................................20 Equation 6: Optimal action value function ........................................................................................20 Equation 7: Policy update equation by minimizing the Bellman residual for value iteration .......25 Equation 8: Policy update equation by minimizing the residual gradient for value iteration.......26 Equation 9: Policy update equation by minimizing the Bellman residual for Q-learning .............26 Equation 10: Policy update equation by minimizing the residual gradient for Q-learning...........27 Equation 11: The gradient of ( )thPr .................................................................................................31 Equation 12: A discounted sum of rewards .......................................................................................35 Equation 13: A long term average reward .........................................................................................35
6
Abstract Reinforcement Learning (RL) is a learning framework in which an agent learns a
policy from continual interaction with the environment. A policy is a mapping from
states to actions. The agent receives rewards as feedback on the actions performed.
The objective of RL is to design autonomous agents to search for the policy that
maximizes the expectation of the cumulative reward.
When the environment is partially observable, the agent cannot determine the states
with certainty. These states are called hidden in the literature. An agent that relies
exclusively on the current observations will not always find the optimal policy. For
example, a mobile robot needs to remember the number of doors went by in order to
reach a specific door, down a corridor of identical doors.
To overcome the problem of partial observability, an agent uses both current and past
(memory) observations to construct an internal state representation, which is treated
as an abstraction of the environment.
This research focuses on how features of past events are extracted with variable
granularity regarding the internal state construction. The project introduces a new
method that applies Information Theory and decision tree technique to derive a tree
structure, which represents the state and the policy. The relevance, of a candidate
feature, is assessed by the Information Gain Ratio ranking with respect to the
cumulative expected reward.
Experiments carried out on three different RL tasks have shown that our variant of the
U-Tree (McCallum, 1995) produces a more robust state representation and faster
learning. This better performance can be explained by the fact that the Information
Gain Ratio exhibits a lower variance in return prediction than the Kolmogorov-
Smirnov statistical test used in the original U-Tree algorithm.
7
Acknowledgement My deepest gratitude goes to my supervisor, Frederic Maire. Over the past years, my
demand of his time and patient for discussion, feedback and proof readings could be
at best described as unreasonable.
Without his constant encouragement (and harassment) and his expertise, I am sure
that this thesis would never have reached completion. In several cases, important
ideas would have fallen by the wayside without Frederic’s insight to point out the
interest.
I also thank C.I.T.I. for funding my study. Without this, not only would I not have had
the freedom to pursue my own research, I would have never have had the opportunity
to perform research at all.
Manix Au
04.03.2004
8
Introduction To The Thesis
Reinforcement Learning
Reinforcement Learning (RL) is a computational approach to automating goal-
directed learning and decision-making. It is a problem description rather than a
specific method (Sutton & Barto, 1998).
An RL agent learns how to map situations (states) to actions to achieve some given
tasks. The agent is not informed which actions to take, but instead must discover the
actions that provide long-term benefits, by trial and error.
The trial-and-error solution search characteristic gives RL significant practical value.
In many complex non-linear control problems, great difficulty arises in determining
the behaviour strategy of a controlling agent (Blythe, 1999). Moreover, it is
impractical to obtain examples of desired behaviour that are correct and representative
of all the situations in which the agent has to act. Using RL techniques, the agent
learns from its own experience to solve the problem autonomously through successive
interactions, without the need of supervision (Gullapalli, 1992).
Scope of the study
This thesis addresses an important research aspect of Reinforcement Learning. The
aim is to document the development of a new technique for the automated
construction for a goal directed representation of systems states in a decision tree.
The application of the system in Reinforcement Learning is demonstrated in three
problems where the goal of the system is to find a solution to a control problem
through experimentation. The new method is a variation on the existing U-Tree
method (McCallum, 1995) and the difference is in the selection criteria that are used
in tree construction. The proposed Information Gain Ratio test is compared and
contrasted with the Kolmogorov (K-S) test.
9
Overview
The thesis is structured in four chapters.
Chapter 1 presents an introduction to Reinforcement Learning (RL). Section 1.1
establishes the fundamental concepts in RL. In section 1.2, the components and the
learning structure of a typical RL system are explained. Section 1.3 describes the
different types of environments. Section 1.4 introduces the learning aspect of RL and
provides an overview of the different solution methods. The ancillary issues and some
successful examples of RL application are described in sections 1.5 and 1.6
respectively. Section 1.7 looks at the research challenges in RL while section 1.8
introduces automatic state construction as a strategy to solve the problems.
Chapter 2 presents the U-Tree method (McCallum, 1995), which is a RL algorithm
with automatic state construction functionality. The U-Tree method allows an agent to
construct and refine a tree structured representation of its internal state and policy by
extracting relevant features from the RL system. These features are composed of
present and past observations of the environment or events. Hence, the policy derived
is reactive to both the current and past observations. This allows a form of short term
memory into the behaviour.
In section 2.1, the U-Tree algorithm is explained in terms of how decision tree
technique can be applied in internal state construction. Details of the Kolmogorov-
Smirnov Two Sample test, which is the selection criterion used for tree expansion and
internal state refinement are discussed.
Section 2.2 provides the pseudo code for the U-Tree algorithm in terms of how
learning can be facilitated by a U-Tree and how a U-Tree grows with experience.
Section 2.3 discusses the limitations and shortcomings of the U-Tree algorithm.
Chapter 3 presents a variation of the U-Tree algorithm (McCallum, 1995). A new
feature extraction criterion, the Information Gain Ratio (IGR) test, is introduced to the
U-Tree framework for internal state construction.
Section 3.1 introduces the variant of the U-Tree by beginning with a description of the
original U-Tree. This then follows a section that discusses the technical background
of Information Theory and then on how it can be applied to give a new RL algorithm.
Section 3.2 provides the pseudo code for the IGR test U-Tree algorithm in terms of
the internal state refinement process.
10
A set of three experiments are conducted to assess the IGR test with respect to the K-
S test in growing a U-Tree. In section 3.3, the ANOVA test, which is used to compare
performance between the two U-Tree algorithms, is discussed. Then, the three
problem domains are described. These problem domains include robot navigation,
driving, and elevator control. Each problem domain subsection provides the
description of the environment, the task, the action set, the reward, the candidate
feature set, the training conditions, results and comparisons.
Chapter 4 lays a foundation for the work that follows.
Original Contributions
o Introductory guide to Direct Gradient methods
o The application of decision tree technique in automatic state construction
o The implementation of the simulations and the tree structure
o Apply the IGR test to the U-Tree framework to produce a variant of the U-Tree
for comparison in three set of experiments
o Performances were evaluated and the results were accepted in the CIMCA
conference 2004
11
1 Introduction to Reinforcement Learning
Chapter summary
Chapter 1 presents an introduction to Reinforcement Learning (RL). Section 1.1
establishes the fundamental concepts in RL. In section 1.2, the components and the
learning structure of a typical RL system are explained. Section 1.3 describes the
different types of environments. Section 1.4 introduces the learning aspect of RL and
provides an overview of the different solution methods. The ancillary issues and some
successful examples of RL application are described in sections 1.5 and 1.6
respectively. Section 1.7 looks at the research challenges in RL while section 1.8
introduces automatic state construction as a strategy to solve the problems.
1.1 What is Reinforcement Learning?
Reinforcement Learning (RL) is a computational approach to automating goal-
directed learning and decision-making. It is a problem description rather than a
specific method (Sutton & Barto, 1998).
An RL agent learns how to map situations (states) to actions to achieve some given
tasks. The agent is not informed which actions to take, but instead must discover the
actions that provide long-term benefits, by trial and error. For example, a chess
playing agent must be able to determine which moves have been critical to the
outcome and then alter its strategy accordingly.
The trial-and-error solution search characteristic gives RL significant practical value.
In many complex non-linear control problems, great difficulty arises in determining
the behaviour strategy of a controlling agent (Blythe, 1999). Moreover, it is
impractical to obtain examples of desired behaviour that are correct and representative
of all the situations in which the agent has to act. Using RL techniques, the agent
learns from its own experience to solve the problem autonomously through successive
interactions, without the need of supervision (Gullapalli, 1992).
12
1.2 The architecture of a RL system
A RL system is composed of an agent and its environment.
1.2.1 The agent and the environment
In a RL system, a learning agent is embedded in an environment. The environment
consists of everything outside of the agent. It is what the agent can perceive and act
on. In the RL framework, the agent-environment interaction is abstracted in a triplet
of (state, action, reward).
• A state represents a particular snapshot of the environment or a situation. It
becomes the basis for the agent to select an action to carry out.
• An action represents a decision made by the agent depending on the state of
the environment.
• A reward is a scalar feedback from the environment. It quantifies how good a
performed action is.
The following table illustrates some exemplary control problems with the states, the
actions and the reward defined.
Task State Action Reward A chess game All the possible
combination of the chess board
The legal moves +1 when the game is won; -1 otherwise
Cart pole balancing The deviation angle, the angular
velocity of the pole, the velocity of the cart and the distance of the cart
from the edge
The magnitude and direction of force applied to the cart
-1 when the pole falls or when the cart goes over the edge, otherwise 0
Motion control of a robotic arm in a
repetitive pick-and-place task
The object, the joint angles and
velocities
The amount of voltage applied to
motors in each joint
+1 if each object is successfully
placed, otherwise 0
Object avoidance behaviour of a mobile robot
The positions of the objects in the
panoramic view of the robot
Travelling forward, left, right and rear.
-1 if the agent bumped into an
object, otherwise 0
Table 1: Examples of RL tasks
13
1.2.2 Overview of the agent-environment framework
Diagrammatically, the agent-environment interaction can be represented as follows.
Figure 1: Agent-environment interaction framework
The agent and the environment interact continually in a sequence of time steps. At an
arbitrary time step, the agent observes the environment’s state Sst ∈ where S
represents the set of states of the environment. Depending on the state, the agent
selects an action ( )tt sAa ∈ where ( )tsA represents the set of action available in state
ts . The agent then receives a numeric reward tr as a consequence of the action ta
performed and the environment changes to a new state 1+ts .
1.2.3 The learning structures of a typical RL problem
The key learning structures to a RL problem are the policy, the reinforcement function
and value function.
• A policy tπ is part of the agent. It is defined as a mapping from the states to
the actions.
14
• A reinforcement function is designed to provide a reward to the agent, on the
action performed in a particular state at each time step. The aim is to implicitly
define the desired behaviour of the agent in terms of rewards so that a good
policy can be learnt through trial and error.
• Value functions provide a measure of the goodness of the states or the
goodness of the state and action pairs. The mathematical definition of the
value function forms the basis for the development of efficient value-based RL
algorithms.
1.3 Different types of the environment
Different types of environments require more domain specific approaches. The
environments can be categorized according to the nature of the state space, the
observability and the availability of a model of the environment.
1.3.1 The nature of the state space of the environment
In simple RL systems, the state space of the environment is discrete. For example, the
state space in a tic-tac-toe game is represented by all the possible combination of the
circles and crosses on the board. In practice, many RL systems have continuous state
space. In the pole balancing problem, the state space is by the cross product formed
by some features of the environment. These features are the angle of the pole, the
angular velocity of the pole, the position of the cart, the velocity and / or the
acceleration of the cart.
An environment contains many features. However some are irrelevant. For example,
the position, the colour and the size of the circles and crosses are features of the
environment in a tic-tac-toe game. A logical state space can be defined using the set
of circles and crosses combination on the board, regardless the colours or sizes of
those. Only the position of the circles and the crosses is of concern for state definition.
1.3.2 The observability of the environment
When an environment is termed fully observable, all the features are well-perceived
by the agent. The agent can use its perception to comprehend the environment’s state.
For example, a chess-playing agent needs to know the positions of all the chess pieces
to define the state so as to make a move.
15
The environment is said to be partially observable when some features of the
environment are not perceivable. Partial observability can be caused by various
reasons such as the limitations of the sensors, noise and occlusions. Under these
circumstances, an agent is unable to disambiguate amongst the different states. The
undistinguished states are called hidden states. For instance, a driver agent learns to
navigate through traffic with the state space defined by the position of the vehicles in
the agent’s current view. To pass a vehicle that is close in front, the agent needs to
attend the blind spot for clearance to overtake. In this situation, accident could happen
because the agent cannot perceive both views at the same time.
1.3.3 The availability of a model of the environment
When a model of the environment is provided, the transition dynamics of the
environment is known to the agent. During the agent-environment interaction, the
agent performs an action, which causes the environment to transit into the next state.
Deterministic state transition implies that the state transition probability from a state
to another, given an action, is unity. Stochastic state transition imparts a state would
exit to a possible number of states, given an action, with probabilities associated
accordingly. When the model of the environment is given, the state transition
probabilities from the current state to all the possible future states are known.
16
1.4 The learning aspect
RL methods originated from two disciplines, Dynamic Programming (DP) and
supervised learning. DP is a field of Mathematics, which has traditionally been used
to solve problems of optimization and control. Nevertheless, traditional DP is limited
in the size and the complexity of the problem it can address. Supervised learning is a
general method for training a parameterized function approximator to represent
learning functions. It requires sample pairs of input and output from the function that
is to be learnt. Unfortunately, sample outputs cannot be easily obtained in practice.
The following sections explain the classic RL methods, which apply the control
algorithms developed in DP to learn the value of action and hence to facilitate control.
Section 1.4.5.2 introduces a modern RL approach, the Policy Search method, which
learns control directly through some gradient function.
1.4.1 The policy, the return and the value function
A policy tπ , represents a probability mapping from states to actions.
( ) ( )ssaaas ttt === |Pr,π . A RL agent learns to select a good action depending on
the state and RL methods specify how the agent changes its policy as a result of its
experience.
The reward tr represents the immediate feedback for the action ta chosen given the
state ts . To choose an action that is beneficial in long term, the agent tries to
maximize a quantity called the return tR .
The return tR is the sum of discounted rewards from the current state proceeding to
the terminal state. Formally, the return can be defined by ∑=
−=T
tnn
tnt rR γ where γ ,
10 ≤≤ γ is a discount factor and T , tT ≥ is the time when the terminal state is
reached.
17
The discount factor γ determines the ratio of the future rewards to be discounted. The
table below illustrates the effect of γ on tR .
Discount factor Return Proportion of tr in tR 0=γ tt rR = 1 1=γ Tttt rrrR +++= + K1 T
1
Table 1: Effect of discount factor on return
When the discount factor 0→γ , the return tR attends more on the rewards in near
future relatively. When the discount factor 1→γ , the return tR takes into account the
rewards in further distant future.
To obtain a good policy, an agent must select actions that maximise the return such
that the agent can move into better states towards the goal. A value function is a
goodness measure of a state and there are two forms of value functions being
considered in RL.
Under policy π , a state value function πV estimates the goodness for an agent to be
in a given state; an action value function πQ estimates the goodness for an agent to
perform a given action in a given state.
The estimation of value functions is defined with respect to a particular policy. This
forms the foundation of the classic Value Based RL algorithms where the estimation
of value functions affects the learning of a policy.
Formally, under a policy π , the value of a state s , which is denoted ( )sVπ , is defined
as the expected return when starting in s and following π thereafter such that
( ) [ ]ssREsV tt == |ππ . Similarly, the value of performing action a in state s under
policy π , denoted ( )asQ ,π , defines the expected return starting from s , taking
action a and thereafter followingπ . We have ( ) [ ]aassREasQ ttt === ,|, ππ .
Value functions are estimated from experience. They can be represented in table
format in simple and discrete state space or with parameterized function
approximators under the continuous state space.
18
1.4.2 The key to learning
The Bellman equations (Bellman, 1957) are mathematical equations that form the key
to Value Based RL. By expressing the value functions in Bellman equation form, the
value of the states can be iteratively updated and learnt.
The theoretical development of the Bellman equations is made under the model
assumption of a Markov process, which is introduced in the following section, the
Markov property.
1.4.2.1 The Markov property
The Markov Property is a mathematical assumption that the dynamics of a system is
independent of any observations or events beyond the immediate past.
1.4.2.1.1 Markov process
A Markov process is a stochastic process in which the future distribution of a variable
depends only on the current value of the variable. A Markov process that describes
the state transition of an environment is mathematically expressed as follows.
( ) ( )ttttt ssssssss |'Pr,,,|'Pr 1011 === +−+ K
Equation 1: Markov property
The equation states that the state transition depends only on the current state. This
memory-less property of the state transition is called the Markov property.
1.4.2.1.2 Markov Decision Process
A dynamic system that satisfies the Markov property is called a Markov Decision
Process (MDP).
Definition: In the RL framework, a MDP is composed of (Puterman, 1994) a state
space S ; an action space A ; a reinforcement function ℜ→× ASr : , a policy
AS →:π and a state transition function ( )SPASP →×: .
It is assumed that the Markov property holds for the state transition and the reward
that they depend only on the current state and the action at each time step.
( ) ( )tttttttt asrrssasasrrss ,|,Pr,,,|,Pr 110011 =′===′= ++++ K
Equation 2: Markov assumption on the state transition and reward
19
Given any state Ss∈ and action Aa∈ , the transition probability into the next
possible state s′ , is denoted by ( )aassssP tttass ==′== +′ ,|Pr 1 . The expected value
of the next reward is denoted by ( )ssaassrER ttttass ′==== ++′ 11 ,,| .
1.4.2.2 The Bellman equations
The Bellman equation provides the basis for the agent to approximate and learn value
functions for the policy (Bellman, 1957). The one-step dynamic of the environment
holds a recursive relationship between the value of the current state and the value of
the next possible states.
( ) [ ]( )[ ]∑
′′′ ′⋅+=
===
s
ass
ass
ttt
sVRP
aassREasQ
π
ππ
γ
,|,.
Equation 3: Action value function
Similarly, a state value function states that the value of a state must equal the sum of
the discounted value of the expected state and the return expected over all the possible
expected states and over all the actions that cause the possible transitions.
( ) [ ]( ) ( )[ ]( ) ( )asQas
sVRPas
ssREsV
a
a s
ass
ass
tt
, ,
,
|
π
π
ππ
π
γπ
∑
∑ ∑=
′⋅+=
==
′′′
Equation 4: State value function
20
1.4.2.3 The Bellman optimality equations
The state value function ( )sVπ represents the value of s under π . The state value
function defined under an optimal policy *π is termed an optimal state value function
( ) ( )sVsV ** =
π.
When ( )sV * is achieved, the value functions in the Bellman equation form can be
expressed in a special form, known as the Bellman optimality equation.
In Bellman optimality equation form, ( )sV * is defined as the expected return for the
best action from that state.
( ) ( )( )[ ]∑
′′′ ′+=
=
s
ass
assa
a
sVRP
asQsV*
**
max
,max
γ
Equation 5: Optimal state value function
Similarly, the optimal action value function states that the optimal value of a state and
action pair equals the sum of the expected return from the state and action pair and
that from the best action in the next state.
( ) ( )[ ]( )( )[ ]
( )( )[ ]∑′ ′′′
+′+
++
′′+=
==′+=
==+=
s a
ass
ass
tttat
tttt
asQRP
aassasQrE
aasssVrEasQ
,max
,|,max
,| ,
*
1*
1
1*
1*
γ
γ
γ
Equation 6: Optimal action value function
The optimal value function *V is the fixed point of the function ( )VfV → where
( ) [ ]∑′
′′ +=s
ass
assa
VRPVf max γ . This fixed point *V is the limit of the sequence
( ) ( ) ( )VfVfVfV n , , , , 2 K .
So, the Bellman optimality equation consists of a system of non-linear equations with
one Bellman optimality equation for each state, independent of the policy π and the
fixed point *V can be iteratively solved by substituting πV in the Bellman optimality
equation.
21
1.4.3 An example: A finite grid environment
A simple RL problem under a finite grid environment is used to demonstrate how an
optimal policy can be obtained with the use of a value function.
The environment is represented by a four-by-four grid with each square as a state. The
action set is composed of four directional motor actions as moving up, down, left and
right one grid distance. The reinforcement function is -1 in every state. The terminal
states are located on the upper left and the lower right square, labelled ‘Goal’. The
agent starts randomly on the grid and tries to reach any of the two goal states. It
follows a random policy to choose one of the four possible actions.
Goal
Goal
Figure 1: A finite grid problem
The value function learnt under a random policy is shown in the following grid with
the numbers in the square indicating the expected values of the states. For instance,
when starting in the lower left corner and following a random policy, the agent takes
14 transitions on average to reach the goal state.
0 -14 -20 -22
-14 -18 -22 -20
-20 -22 -18 -14
-22 -20 -14 0
Figure 2: Value function under random policy of a finite grid problem
Since a random policy is suboptimal, the agent uses a new policy. It follows a new
action selection heuristic to move into a neighbouring state with a higher state value.
As a result, the agent always moves to a better state that is closer to the terminal state.
22
This new policy is optimal with respect to reaching the terminal state quickly. The
optimal value function derived under this optimal policy is shown, follows the
optimal policy diagram.
0 -1 -2 -3
-1 -2 -3 -2
-2 -3 -2 -1
-3 -2 -1 0
Figure 3: Value function under optimal policy of a finite grid problem
Figure 4: Optimal policy diagram of a finite grid problem
This example illustrates the concept of classic Value Based methods, which use value
functions to derive an optimal policy. Many algorithms in RL are devised to find the
value functions efficiently.
23
1.4.4 Relationship between the optimal value functions and the optimal policy
Value functions can be used to describe a partial ordering over some policies. This
partial ordering of policies is explained as followed.
Definition: A policy π is better than or equal to a policy π ′ if and only if
( ) ( )sVsV ππ ′≥ for all Ss∈ .
The optimal value function ( ) ( )sVsV ππmax* = is the unique fixed point of the Bellman
optimality equation. Therefore, the optimal value function *V can be attained by a set
of optimal policies, amongst which there exists one deterministic optimal policy.
When the optimal value function *V is solved, a deterministic optimal policy *π
chooses only the action at which the maximum is attained in the Bellman optimality
equation for each state. In other words, a deterministic optimal policy *π assigns non-
zero probability only to these actions.
For finite MDPs, the deterministic optimal policy *π is said to be “greedy” with
respect to the optimal state value function. This is because the actions that appear best
after a one-step search are treated the optimal actions once the optimal state value
function is defined. The term “greedy” as in “greedy policy” is used in computer
science to describe any search or decision procedure that selects alternatives based
only on local or immediate considerations. Greedy search algorithm does not involve
any backtracking, which considers the possibility that a selection may prevent future
access to even better alternatives.
In the case of optimal action value function ( ) ( )asQasQ ,max,*ππ
= , the one-step
search is not required. The agent can simply find any action that maximizes ( )asQ ,* .
And this is the benefit of representing the function of the state-action pairs, instead of
the states, when the dynamics of the environment is not provided.
24
1.4.5 The different techniques in solving RL problems
There are two major learning approaches in RL problems. They are the classic Value
Based methods and the modern Policy Search methods. In RL problems, the aim is to
learn a good policy. Policy Search methods and Value Based methods differ in the
mechanism of how a good policy can be learnt (Aberdeen, 2002).
In Value Based methods, the value functions are learnt in order to obtain a good
policy (Sutton & Barto, 1998, Littman & Sun, 2000). In contrast, Policy Search
methods do not require any value functions to update the policy. The policy is learnt
directly through a parameterized function, which is changed according to some
measurements (Aberdeen, Baxter, 2002, Hausen, 1998, Shelton, 2001). An example
of these measurements is the gradient of the return (Baird & Moore, 1999).
1.4.5.1 Value based methods (learning the value of actions)
In value based methods, the value functions are iteratively updated by satisfying the
the Bellman equation. The policy is output indirectly via the learning of value
functions.
1.4.5.1.1 Value iteration (learning with a model)
For a state s , let ( )sV be the approximation of the true but unknown optimal value
function ( )sV * . In general, ( )sV values are randomly initialized.
Let s′ be the possible state from s . Iteratively loop on all s , the approximation ( )sV
is updated to solve for ( )sV * by satisfying the Bellman equation
( ) ( )[ ]∑′
′′+ ′+=s
kass
assak sVRPsV max1 γ where ( )sVk represents the thk approximation of
( )sV to ( )sV * .
Lookup tables
For lookup table value functions, the optimal value function can be solved by
updating each state value according to the Bellman optimality equation,
( ) ( )[ ]∑′
′′ ′+←s
ass
assa
sVRPsV max γ . The sweep update of state values minimizes the
Bellman residual ( )[ ] ( )sVsVRPs
ass
assa
−⎟⎠
⎞⎜⎝
⎛ ′+=Δ ∑′
′′ max γ until the residual is smaller
25
than a positive constant. Upon the convergence of the state value functions, an
optimal policy can be defined as ( ) ( )[ ]∑′
′′ ′⋅+⋅=s
ass
ass
asVRPs ** maxarg γπ .
Value based methods in lookup table can be severely affected by the size and nature
of the state space due to sweep update.
Function approximators by minimizing the Bellman residual
In many real world problems, the state spaces can be very large and continuous. For
this reason, a continuous model such as neural network or some other function
approximator, is needed for the approximation ( )wsV , of optimal value function
( )sV * , where w is the network parameter vector.
Given α is the learning rate, the parameter w is updated to minimize the Bellman
residual, ( )[ ] ( )ts
tass
assa
wsVwsVRP ,, max −′+⋅∑′
′′ γ according to the following update
equation.
( )[ ] ( ) ( )t
tt
st
ass
assat w
wsVwsVwsVRPw∂
∂⎥⎦
⎤⎢⎣
⎡−′⋅+⋅−=Δ ∑
′′′
,,, max γα
Equation 7: Policy update equation by minimizing the Bellman residual for value iteration
Using function approximator for value based methods can result in convergence
problem in wΔ . The desired value of the state ( )[ ]∑′
′′ ′+s
tass
assa
wsVRP , max γ is
expressed as a function of the parameter wwt = at time t . When the update ww ′→
is performed, the target value changes as a new function of a different parameter
wwt ′=+1 . As a result, the value of the Bellman residual can increase and this causes
the value of the parameter w to oscillate or to grow to infinity.
Using function approximators by residual gradient
To overcome the convergence problem of using function approximators by Bellman
residual, gradient descent technique is performed on the mean squared Bellman
residual. This method is termed residual gradient algorithm with residual term
( ) ( )[ ]ttt wsVwsVr ,, −′+ γ , which guarantee convergences of the parameter w to the
local minimum. The network parameter update equation is given as followed.
( ) ( )[ ] ( ) ( )⎥⎦
⎤⎢⎣
⎡∂
∂−
∂′∂
−′+=Δt
t
t
ttttt w
wsVw
wsVwsVwsVrw ,, ,, γγα
26
Equation 8: Policy update equation by minimizing the residual gradient for value iteration
1.4.5.1.2 Q-Learning (learning without a model)
Q-learning (Watkins, 1992) was one of the most important breakthroughs in RL
because Q-learning extends the traditional DP value iteration methods for RL.
The value iteration method requires the finding of the action that returns the
maximum expected value. This involves computing the sum of the reinforcement and
the integral over all possible successor state for the given action and this is
computationally expensive in practice.
Q-learning solves this issue by simply taking the maximum over the state value set of
the successor states. Rather than learning the state value for each state as in value
iteration, Q-learning learns the Q-value (or action value) for each state, action pair.
Therefore, there is a Q-value associated with each action in each state. And the Q-
value representing function is called a Q-function (or action value function).
During Q-value update, Q-learning does not require the expected values of the
successive state to be calculated. The value of a state is defined to be the maximum
Q-value in the given state.
Lookup table
When the Q-function is represented in lookup table form, the Q-values are updated
according to ( ) ( )( )asQrasQat ′′+←′
,max , γ . The updates of Q-values minimize the
Bellman residual ( )( ) ( )asQasQrat ,,max −′′+=Δ′
γ .
Using function approximators
When the state and action product space is large, a neural network can be used instead
to train the Q-function. The following equations are the parameter update equation for
minimizing the Bellman residual and that for the residual gradient.
Bellman residual:
( ) ( )[ ] ( )t
tttatt w
wasQwasQwasQrw∂
∂−′′+−=Δ
′
,,,,,,max α
Equation 9: Policy update equation by minimizing the Bellman residual for Q-learning
Residual gradient:
27
( ) ( )[ ] ( ) ( )⎥⎦
⎤⎢⎣
⎡∂
∂−
∂′′∂
−′′+=Δ′
t
t
t
tttatt w
wasQw
wasQwasQwasQrw ,,,, ,,,,max γγα
Equation 10: Policy update equation by minimizing the residual gradient for Q-learning
The property of these function approximators is analogue to those in the value
iteration methods. The form that minimizes the Bellman residual can have
convergence problems whilst the residual gradient form guarantees convergence to a
stable Q function.
28
1.4.5.2 Policy search methods (learning without a model and without
the value functions)
Under the constraints of partial observability, Value Based methods learning is very
complicated (Murphy, 2000, Lanzi, 2000, Lin, Mitchell, 1992). Policy Search
methods have become more practical than Value Based methods in POMDPs
(Aberdeen, Baxter, 2002, Baird & Moore, 1999, Cassandra, 1999, Murphy, 2000) to
avoid the difficulty and complexity involved in learning value functions under partial
observability.
The nature of Policy Search methods is model-free. In policy search methods, the
policy space is searched by direct optimization methods. The policy is explicitly
represented by its own function approximator (Cassandra, Kaelbling & Littman, 1994,
Peshkin, 2000). This function approximator does not learn any value function. During
learning, policy improvement is achieved by direct gradient ascent (or descent) on
some error functions or by paired comparisons amongst policies. In other words, it is
easier to learn how to act in Policy Search methods than to learn the value of actions
as in Value Based methods.
Value Based methods can suffer from convergence problems (Astrom, 1965) where
the value function either diverges or oscillates between the possible best and worst
outcome under greedy exploration. Policy Search methods guarantee to converge to a
local optimum. However, the learning process can be slow and the convergence does
not guarantee a global optimum in the policy search space.
29
1.4.5.2.1 Direct gradient methods
In direct gradient approach, the policy is represented by a parameterized function with
a weight vector w for action selection. The weight vector w is adjusted to improve
the performance of the agent. The adjustment is made in the direction of the gradient
of the expected return.
Preparation for deriving the gradient function of the return
Let h represent a training history such that th is composed of the state, action and
reward triplet from the start time 0 up to time t .
( ) ( ) ( ){ }tttt rasrasrash ,,, , ,,, ,, 111000 K=
The likelihood of 0h is the probability of the events happening at start.
( ) ( ) ( ) ( )000000 |Pr|PrPrPr srsash =
The likelihood of 1h is the probability of the events happening from the start up to
time 1=t .
( ) ( )( )( ) ( )( )( ) ( ) ( )( ) ( ) ( )1111001
00000
01110
01111
|Pr|Pr,|Pr |Pr|PrPr
|,,PrPr ,,,PrPr
srsaasssrsas
hrashhrash
===
Similarly, the likelihood of 2h is the probability of the events happening from the start
up to time 2=t .
( ) ( )( )( ) ( )( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )2222112
1111001
00000
12221
12222
|Pr|Pr,|Pr |Pr|Pr,|Pr
|Pr|PrPr|,,PrPr
,,,PrPr
srsaasssrsaass
srsashrash
hrash
===
30
Hence, the likelihood of a particular th is the probability of the events happening
from the start up to time t .
( ) ( )( )( ) ( )( )
( ) ( ) ( )
( ) ( ) ( )( )( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( ) ( ) ( ) ( ) ( )∏−
=+
−−−−−−
−−
−−
−
⋅⋅⋅⋅⋅=
⋅⋅⋅⋅
⋅⋅⋅⋅=
⋅⋅
⋅⋅⋅==
⋅==
1
010
111111
0010000
0
11
00000
11
1
,|Pr|Pr|Pr|Pr|PrPr
|Pr|Pr ,|Pr|Pr|Pr
,|Pr|Pr|Pr
Pr|Pr|Pr,|Pr
|Pr|PrPr
|,,PrPr ,,,PrPr
t
kkkkkkkktttt
tttt
ttttttt
ttttttt
ttttt
ttttt
asssrsasrsas
srsaasssrsa
asssrsas
srsaass
srsas
hrashhrash
L
L
L
Since the action selection function ( )tt sa |Pr is a non-zero smooth function of the
weight vector w , let ( ) ( )wfsa ttt =|Pr to simplify the expression ( )thPr .
( ) ( ) ( ) ( ) ( ) ( ) ( )∏−
=+⋅⋅⋅⋅⋅=
1
010 ,|Pr|Pr|PrPrPr
t
kkkkkkktttt asssrwfsrwfsh
Note that the terms ( )00 |Pr sr , K , ( )tt sr |Pr , ( )001 ,|Pr ass , K , ( )11 ,|Pr −− ttt ass and
( )0Pr s are not functions of w . The state transition function and the reinforcement
function are part of the environment and they are invariant to the policy parameter.
The likelihood expression, ( )thPr , is then differentiated with respect to w to provide
the gradient of the likelihood ( )tw hPr∂∂ .
31
( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
( )
( ) ( )
( ) ( )( )
( ) ( )( )
( ) ( )( )⎥⎦
⎤⎢⎣
⎡⋅=
⎥⎦
⎤⎢⎣
⎡⋅⋅=
⎥⎦
⎤⎢⎣
⎡⋅=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⋅=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡⋅=
⋅=⎥⎦
⎤⎢⎣
⎡⋅=
⎥⎦
⎤⎢⎣
⎡⋅⋅=
⎥⎦
⎤⎢⎣
⎡⋅⋅⋅=
∑
∑∏
∑∏
∑∏
∑∏
∏∏
∏∏
∏
=∂∂
=∂∂
=
=∂∂
=
=∂∂=
=∂∂
≠=
=+
=∂∂
=∂∂
=+
−
=+∂
∂∂∂
t
kkkwt
t
nnw
t
kk
t
nnw
t
kk
t
nnw
n
t
kk
t
nnw
t
nkk
k
t
kkkkkk
t
kkw
t
kkw
t
kkkkkk
t
kkkkkkktttwtw
sah
wfwfc
wfwfc
wfwf
wfc
wfwfc
asssrscwfc
wfasssrs
asssrwfsrwfsh
0
00
0 0
0
0
0 0
010
0
0010
1
010
|PrlnPr
ln
ln
,|Pr|PrPr where
,|Pr|PrPr
,|Pr|Pr|PrPrPr
Equation 11: The gradient of ( )thPr
This likelihood gradient ( )tw hPr∂∂ will be used in the calculation of the gradient
function of the return.
The return
Let Η represent a set of training history h . Let ( )hR be the return and
( )[ ] ( ) ( )∑Η∈
⋅≈h
hhRhRE Pr be the total expected return obtained from the training
history h .
32
The gradient of the return
Let the terminating time of training history h be ( )hT .
By differentiating ( )[ ]hRE with respect to w and substituting the result from ( )hw Pr∂∂ ,
the return gradient ( )[ ]hREw∂∂ can be obtained as follows.
( )[ ] ( ) ( )
( ) ( ) ( ) ( )[ ]
( ) ( ) ( ) ( ) ( )( )( )
( ) ( ) ( ) ( )( )( )
( ) ( ) ( )( )( )
∑ ∑
∑ ∑
∑ ∑
∑
∑
Η∈ =∂∂
Η∈ =∂∂
∂∂
Η∈ =∂∂
∂∂
Η∈∂∂
∂∂
Η∈∂∂
∂∂
⎥⎦
⎤⎢⎣
⎡⋅⋅=
⎥⎦
⎤⎢⎣
⎡⋅+⋅=
⎥⎦
⎤⎢⎣
⎡⋅⋅+⋅=
⋅+⋅=
⎥⎦
⎤⎢⎣
⎡⋅=
h
hT
kkkw
h
hT
kkkww
h
hT
kkkww
hww
hww
hahRh
hahRhRh
hahhRhRh
hhRhRh
hhRhRE
0
0
0
|PrlnPr
|PrlnPr
|PrlnPrPr
PrPr
Pr
The term ( )hRw∂∂ vanishes because the return is part of the environment. The policy
parameter is adjusted to change the probability of some training history happening but
does not alter the return obtained over that training history.
For example, the following figure shows a training history set of four histories,
{ }4,3,2,1 hhhh with their respective return observed. In Direct Gradient
methods, the policy parameter w is adjusted to increase the probability of 3h
happening and to reduce that of 4h . However, this change in policy does not affect
the return observed over the respective training histories.
Figure 5: Example of return realization in training history
33
Under the Markov assumption, ( ) ( ) ( )wfsaha ttttt == |Pr|Pr represents the policy.
The gradient of the return can be re-expressed as follow.
( )[ ] ( )[ ] ( ) ( ) ( )( )( )
∑ ∑Η∈ =
∂∂
∂∂
∂∂
⎥⎦
⎤⎢⎣
⎡⋅⋅==
h
hT
kkwww wfhRhwhREhRE
0lnPr,
The return gradient vector ( )[ ] ( )[ ] ( )[ ] ( )[ ][ ]hREhREhREwhREwwww ∂∂
∂∂
∂∂=∇ ,,,,
21L
can now be obtained.
Perform gradient ascent on the return and update the policy vector
The return gradient vector ( )[ ]whRE ,∇ is the gradient of the return with respect to the
weight vector of a current policy. The policy is represented by a point w in the policy
space. It is possible to move the value of w a small distance δ in the direction
( )[ ]( )[ ]whRE
whREd,,∇
= , which is the direction in which the return ( )[ ]hRE increases most
rapidly. The general policy update equation for direct gradient methods is provided by
dww ⋅+← δ .
34
1.4.5.2.2 Policy comparison methods
Policy comparison approach is characterized by using comparative information on
some policies to find the optimal policy. Given a policy is represented by a
parameterized function with a weight vector w , the objective is to minimize the
difference in expected return ( ) ( )[ ] ( )[ ]wREwREwwD −= **, between the optimal and
some policy under comparison (Strens & Moore, 2002, Ng & Gordon, 2000).
When comparing two policies with parameter vector 1w and 2w respectively, a fixed
set of initial states { }nsss ,02,01,0 , , , L is used to reduce the stochasticity of the
environment such that the variability in return difference ( )21,0wwDs over a particular
initial state, is prominently subjected to the difference in policy parameters.
Given a large set of initial states n , paired statistical tests can be applied to model the
difference between two policies, which are evaluated with the same starting states.
Possible choices of the statistical test include the 2 sample t-test and the Wilcoxon
test.
The 2 sample t-test assumes the n return differences follow a standard normal
distribution and returns a probability, which indicates the likelihood of a non-zero
mean for the return differences. The Wilcoxon test uses the rank of the return
differences to return a probability that indicates the likelihood of a non-zero median.
Once a better policy is determined, the less effective policy is replaced by a new
policy. The parameter of the new policy neww can be generated by some optimization
procedures. These procedures include the Downhill Simplex Method, which generate
neww in the opposite direction of the previously less effective policy parameter and
Differential Evolution, which uses genetic algorithms to improve neww directly.
35
1.5 Ancillary issues
The following two sub-sections describe two common RL issues namely the
exploration-exploitation dilemma and the temporal credit assignment problem. The
last sub-section explains a shaping technique, which is used to overcome complex RL
problems.
1.5.1 Exploration–exploitation dilemma
An agent is exploiting when it selects an action to obtain reward that is already
known; exploring when it selects another action to gain new information. It is
observed that a system may lose performance by exploring while it may never
improve its performance through exploiting, (Holland, 1992). It is important to
balance the exploration and exploitation ratio so as to maximize the knowledge gained
during learning and to minimize the costs of exploration and learning time. Common
practice used to overcome the exploration-exploitation dilemma is to implement a ε -
greedy policy, in which the agent exploits with a probability of ( )ε−1 and explores
with a probability of ε .
1.5.2 Temporal credit assignment
The rewards provided to the agent can be noisy and delayed from the actions that
caused them. The designer has to deal with the problem of how to reinforce actions
that have long-reaching effect. The return to be maximized can be formulated as the
followings.
( ) ⎥⎦
⎤⎢⎣
⎡== ∑
∞
=
ssrEsRt
tt
t 00
| γ
Equation 12: A discounted sum of rewards
( ) ⎥⎦
⎤⎢⎣
⎡== ∑
=∞→
T
ttTTt ssrEsR
10
1 |lim
Equation 13: A long term average reward
The discount version of reward is more robust because it allows the solution of the
temporal credit assignment problem in infinite horizon settings (Sutton & Barto,
1988).
36
If the class of the state transitions is ergodic, the states are eventually visited
arbitrarily often. The long term average version is used since it gives equal weighting
to all rewards received throughout the evolution of the process.
1.5.3 Shaping
The size of the state space correlates to the complexity of the task given. Searching
through a large policy space for optimal actions could be time consuming. Shaping is
used to ease the agent’s learning in complex problems. This is achieved by giving the
agent a series of relatively easy problems building up to the harder problem of
interest, (Selfridge, Sutton & Barto, 1985).
Consider an example of training a mechanical arm to pick up an object. The machine
has control over the various joints of the arm. Given the object is within reachable
distance of the arm, the movement of joints that allow the claw to move on top and
then go down to hold onto the object is combinatorial.
Shaping can be applied to ease learning by decomposing the task into simpler sub-
tasks. The arm can firstly learn to bring the claw in line with the object. Then, it learns
to bring the claw over the object. And finally, the arm can bring the claw down to
hold onto the object.
37
1.6 RL Applications
RL methods are practical for real applications. In a typical control situation, the
controllers of automated processes must adapt to a dynamically changing environment
where the optimal heuristic is not known. A substantial number of applications have
proven that the automated nature of RL is fruitful.
1.6.1 Cellular communication system
A RL method is applied to a large cellular telephone system with approximately 4970
states (Singh, Jaakkola & Jordon, 1995). The task of bandwidth allocation is
formulated as dynamic programming problem. The goal is to find a dynamic channel
allocation policy, which maximizes services by minimizing the number of blocked
calls and drop-off calls.
In this optimal channel control problem, the large state space makes traditional DP
techniques impractical. This large scale optimal control problem is solved by the DP
induced RL method. Better solutions are found than any previously heuristics
available. This demonstrates the superiority of RL paradigms in complex and large
optimal control applications.
1.6.2 Others
Other concrete and successful RL applications are elevators controls (Crites & Barto,
1996), job-shop scheduling (Zhang & Dietterich, 1995), game learning such as chess
and Tesauro (Sutton & Barto, 1998), power network distribution (Schneider, 1999)
and Internet partner scheduling (Abe & Nakamura, 1999).
38
1.7 Research challenges in RL
Current research in RL has addressed the scalability of algorithms in large state
spaces, the partial observability of the environment and the limitation of reactive
behaviour upon tasks which require memory.
1.7.1 Scaling up to large problems
1.7.1.1 Complex task
Hierarchical decomposition of task represents a strategy for dealing with very large
state spaces (Hernandez-Gardiol & Mahadevan, 2000, Dietterich, 2000). Such
motivation is to provide faster learning, with a trade-off of a slight sub-optimality in
performance, through the decomposition of a task into a collection of simpler
subtasks. This is essentially relevant to situations when the training time is limited.
1.7.1.2 Large and continuous state spaces
Compact value function representation is necessary for value function approximation
in problems with large and continuous state spaces (Cassandra, 1998, Hauskrecht,
2000, Uther & Veloso, 1996). The common practice in function representation is
using a neural network, which has a disadvantage of being slow to train. Newly
existing approaches to value function approximation include methods, which are
based on fitted regression trees (Wang & Dietterich, 2002) and support vector
machines (Dietterich & Wang, 2002).
1.7.1.3 Large and continuous action spaces
To learn and act in a large continuous action space, one possible solution is to smooth
the probability transition functions of similar action during learning (Aberdeen,
Baxter, 2002, Meuleau, Kim, Kaelbling, Cassandra, 1999).
1.7.1.4 Intelligent exploration
In most RL methods, the agent is uninformed or minimally informed about the policy
search method to explore the environment. However, in practice, the RL system
designer can provide guidance to the agent in the form of an initial policy (Shelton,
2001, Shapiro, Langley & Shachter, 2000). This information is typically sub-optimal.
39
The research question arises on how to exploit the initial policy learning without
preventing the future development of the optimal policy.
1.7.2 Partial observability of the environment
RL is a common approach to the POMDP training problem (Murphy, 2000,
Schmidhuber, 1991). POMDPs are a generalization of MDPs where a finite set of
states depicts the environment’s dynamics but the agent does not have direct access to
the states. It is a model originated in the operations research (OR) literature for
describing planning tasks when complete information is inaccessible during the
current situation. The agent can only infer the state from a set of hopefully state
related observations (Aberdeen, 2002, Cassandra, 1999, Kaelbling, Littman & Moore,
1996, Williams & Singh, 1998).
The POMDP framework models the partially observable dynamics of the environment
(Figure 6). At each time step t , the environment is assumed in a state tx . The agent
performs an action ta to cause a state transition and the environment changes into a
new state tx . However, the state transition is hidden and the agent receives an
observation ty as some stochastic function of the state tx . If the observation
represents the state tt xy = , the POMDP becomes the ideal case of a fully observable
MDP. In addition, the agent receives a scalar reward and the goal of the agent is to
learn a policyπ that maximizes the return.
Figure 6: State transition dynamic under POMDP
40
1.7.3 The need of memory
Under partial observability, the current sensory information an agent receives may not
identify the hidden state. This is because the number of observations is typically far
less than the number of possible states, to be distinctive for policy mapping. And the
true states of the environment are said to be hidden.
1.8 Automatic state construction
Automatic state construction is a process, in which an agent constructs its own
(internal) state representation. It is a common solution in supervised learning to
problems where the current state does not match the input. (Bauer & Pawelzik, 1992,
Bauer & Villmann, 1995).
A self-organising neural network with state construction capability (e.g. Growing
Neural Gas) can add refined nodes into its state space to deal with dynamic input
distributions so as to approximate the input space more accurately.
For the aforementioned reasons in section 1.7, RL agents need the construction of
their own state space. Then, the focus is brought upon how these internal state
representations can be dynamically constructed from the observation history to allow
generalization of useful past experience to apply in new and different situations
(Aberdeen & Baxter, 2002, Dutch, 1998, Lovejoy, 1991, McCallum, 1995).
41
2 U-Tree: A RL algorithm with automatic state construction functionality
Chapter summary
Chapter 2 presents the U-Tree method (McCallum, 1995), which is a RL algorithm
with automatic state construction functionality. The U-Tree method allows an agent to
construct and refine a tree structured representation of its internal state and policy by
extracting relevant features from the RL system. These features are composed of
present and past observations of the environment or events. Hence, the policy derived
is reactive to both the current and past observations. This allows a form of short term
memory into the behaviour.
In section 2.1, the U-Tree algorithm is explained in terms of how decision tree
technique can be applied in internal state construction. Details of the Kolmogorov-
Smirnov Two Sample test, which is the selection criterion used for tree expansion and
internal state refinement are discussed.
Section 2.2 provides the pseudo code for the U-Tree algorithm in terms of how
learning can be facilitated by a U-Tree and how a U-Tree grows with experience.
Section 2.3 discusses the limitations and shortcomings of the U-Tree algorithm.
2.1 Introduction to the U-Tree algorithm
U-Tree (McCallum, 1995) is a RL algorithm designed to overcome the
aforementioned practical issues in RL research. This method focuses on situations
where purely reactive policy performs sub-optimally under partial observability.
The U-Tree algorithm allows the agent to extract relevant information to create its
own internal state representation. This internal state space serves as an abstraction of
the environment, which classifies the indistinctive states observed for learning to act
upon (Aberdeen & Baxter, 2002, Bakker, 2001, Dutch, 1998, Meuleau, Peshkin, Kim
& Kaelbling, 1999).
Leaves of a U-Tree partition the internal state space into a set of most currently
refined internal states. During leaf expansion, the internal state represented is further
refined. During leaf expansion, the internal state represented is further refined.
42
U-Tree is the policy of the agent essentially because it is a classification tree of the
current state of the agent, with value function estimates stored at the leaves for action
output.
2.1.1 Problem domain targeted by the U-Tree
The U-Tree algorithm was designed for partially observable environments with large
state space dimension. It is capable of dealing with both discrete and continuous state
spaces.
Large state space dimension results in abundance of observations. Many of the
observations are not task relevant and are not required for the internal state
construction. Feature extraction prunes away these irrelevances.
Hidden states occur when the current observation alone is insufficient to determine
the state of the environment (McCallum, 1995, Cassandra, 1999). Memory from
previous observations is needed to augment the current perceptual input to reveal the
hidden states.
For example, a driver agent can be overloaded with information of the surroundings
on the road. With respect to the task of driving safely, the agent needs to be aware of
the approaching and upcoming traffic and the traffic signals. Information that
describes an approaching vehicle, such as colour, is irrelevant and to be pruned away.
Since it is impossible to attend to both the frontal and rear views simultaneously,
hidden state problem arises when the agent considers changing lanes. Using only the
current frontal percept, it cannot distinguish between states, which correspond to the
presence and the absence of an approaching vehicle on the lane it wishes to change to.
These two undistinguished states are said to be hidden. In order to change lane safely,
the agent needs to augment the rear view information from previous perception to
current perception to reveal the hidden state.
2.1.2 The architecture of a U-Tree
Before the details of a U-Tree are presented, the concept of decision trees is
discussed.
2.1.2.1 Decision trees
A decision tree is a classifier. It assigns an input to a class amongst a finite number of
classes. The input of a decision tree can be an object or a situation that is described by
43
a set of attributes. Each internal node corresponds to an attribute. Edges originated
from a node are labelled with the possible values of the associated attribute. The input
is classified by performing a sequence of node-edge matches from the root to a leaf,
which represents a class label or a decision.
The following diagram illustrates a possible decision tree, which can be used to
determine when to play tennis. The input of the tree is the sky outlook. The edges
originated from the sky outlook node contain the possible values. They are extended
to their own class labels, which represent a decision upon which an action is made.
For example, it is decided not to play tennis on rainy days.
Figure 7: A decision tree for playing tennis according to the sky outlook
44
2.1.2.2 A U-Tree
A history table records the observations made at each time step. A U-Tree classifies
the history table (the input) into an internal state (the class label).
Table 3 below is a typical history table for a soccer agent before update at time t .
Time 1 2 K 1−t t Ball position Left Left K Unknown Unknown
Action Turn left Turn left K Turn right ? Reward 3 8 K 5.5− ?
Table 2: An example of a history table before update
In a U-Tree, each internal node corresponds to a test on a feature, which consists of a
observation f and its history index lag . The history index allows a form of short
term memory by specifying the lagging in history the observation at the node is to be
tested.
Figure 8: A U-Tree for a soccer agent to get to a ball
45
In Figure 9, the classification process begins at the root node where the observed ball
position at time t is examined. The value of the current ball position is found
‘Unknown’ and this links the root node to the internal node, which examines the
observed ball position at time 1−t . The value ball position in the immediate past is
found ‘Unknown’. A leaf (an internal state) is reached with the action
=ta ‘Panoramic Vision’ being carried out. A reward 6=tr is received and the
history table is updated.
Time 1 2 K 1−t t Ball position Left Left K Unknown Unknown
Action Turn left Turn left K Turn right Panoramic vision
Reward 3 8 K 5.5− 6 Table 3: An example of a history table after update
A U-Tree also represents the policy of a RL agent because the action value vector and
the action selection probability vector are stored at each leaf.
46
2.1.3 Feature extraction
Leaf expansion is periodically carried out to discover relevant features to extend the
tree. A pre-selected pool of candidate feature cf is available at each leaf. When an
appropriate wincf is found at a leaf for expansion, the leaf is extended into a sub-tree
(figure 10).
The internal state represented by the extended leaf is partitioned into a set of more
refined internal states. The associated timely indexed observations, including the
returns, will be classified into the new leaves according to the values of the candidate
feature wincf .
Figure 9: Partition of a state including returns during leaf expansion
2.1.3.1 The selection criterion
The distributional differences found between the return set and its subsets at a leaf
before and after the introduction of a candidate feature cf indicates the disparity of
cf in refining the internal state with respect to return prediction.
Consider a mobile RL agent, which learns obstacle avoidance behaviour with
obstacles that are identical in size but differ in colour. The candidate features are the
‘colour’ and the ‘size’ of the closest obstacle in view.
47
The size of the obstacle is task relevant but the colour is not since obstacles are to be
avoided regardless of their colour. Therefore, the return distribution difference, by the
feature ‘size’, should be more significant than that by the feature ‘colour’.
The Kolmogorov-Smirnov (K-S) Two Sample test is used to provide hypothesis
testing on the distributional difference in return distribution.
2.1.3.2 The Kolmogorov-Smirnov Two Sample test
The Kolmogorov-Smirnov (K-S) test is a statistical test that investigates the
difference between two distributions. The test compares two distributions and outputs
the likelihood of distributional difference.
The nature of this test is non-parametric. The null hypothesis assumes the equality in
distributions under comparison. The probability of the null hypothesis is computed in
terms of the maximum distributional difference.
Procedure
Let 1X and 2X be two distribution with samples 1M and 2M taken
1. Establish the null hypothesis 0H that the distributions of 1X and 2X are equal
2. Construct the class column of a cumulative frequency table
a. Find ( )21max sup MMx ∪=
b. Find ( )21min inf MMx ∪=
c. Partition minmax xx − into m intervals, each with equal interval length
mxxc minmax −=
d. Label the class column x with ( ){ }maxminmin ,1,, xcmxcx −++ L
3. List the cumulative frequencies ( )xXFX <11 and ( )xXFX <22
columns
x ( )xXFX <11 ( )xXFX <22
maxx mXn ,1 mXn ,2
M M M cx +min 1,1Xn 1,1Xn
4. Determine the relative cumulative frequencies columns
( ) ( )1
11
1
1 XxXF
xXE XX
<=< and ( ) ( )
2
22
2
2 XxXF
xXE XX
<=<
48
x ( )xXFX <11 ( )xXFX <22
( )xXEX <11 ( )xXEX <22
maxx mXn ,1 mXn ,2
( )mXn ,1Pr ( )mXn ,2
Pr
M M M M M cx +min 1,1Xn 1,1Xn ( )1,1
Pr Xn ( )1,1Pr Xn
5. Determine the absolute difference 21 XX EED −=
x ( )xXFX <11 ( )xXFX <22
( )xXEX <11 ( )xXEX <22
D
maxx mXn ,1 mXn ,2
( )mXn ,1Pr ( )mXn ,2
Pr md
M M M M M M cx +min 1,1Xn 1,1Xn ( )1,1
Pr Xn ( )1,1Pr Xn 1d
6. Identify the maximum difference ( )Dd supmax =
7. Compute the test statistic 21max MMdTSD ⋅⋅=
8. Check the Chi-square distribution 2, 21 MMχ and determine an acceptance level
( )Dϑ−1 such that if ( )DTS≤2Pr χ Dϑ> , reject null hypothesis 0H and
conclude that 1M and 2M come from different distributions
Figure 10: Probability density function of the 2χ distribution
49
2.2 Construction of a U-Tree
In this section, the U-Tree algorithm is described in pseudo code. It starts with an
overview of the algorithm and follows the description of the exploiting and the
exploring phases.
2.2.1 The U-Tree learning algorithm
Initialization Policy iteration frequency parameter improvefreq Internal state refinement frequency parameter refinefreq The maximum number of episode allowed maxEp The maximum number of time step allowed maxt in each episode The history table Η Initialize the U-Tree with the root node as the only state s For every ( )sAa∈ Initialize ( )asQ , to arbitrary constant Initialize policy π to random action selection
1=nEp // nEp is the episode number Repeat 1=t Repeat ta ←Choose-Action (U-Tree, Η ) 1+← tt Until maxtt = or the episode is completed Update the return R in Η If 0 mod improve =freqnEp ←π Policy-iteration (U-Tree, Η ) If 0 mod refine =freqnEp U-Tree ← Internal-State-Refinement (U-Tree, Η ) 1+← nEpnEp Until maxEpnEp =
50
2.2.2 Using the tree
The U-Tree classifies the history table Η to output ta at each time step
Function Choose-Action (U-Tree, Η ) returns ta
Append observation to to Η ←ta U-Tree-Classification (U-Tree, Η ) Append tt ra , to Η
2.2.3 Improving the tree
The improvement phase consists of the policy improvement process and the internal
state refinement process. Both processes are carried out periodically in accordance
with parameters improvefreq and refinefreq .
2.2.3.1 The Policy Iteration process
The policy iteration process requires the input of Η to update the action values
( )asQ , for each leaf s .
Function Policy-Iteration (U-Tree, Η ) returns U-Tree
For each leaf s For each ( )sAa∈ ( ) { }aasstasI tt === ,|,
( ) ( )
( )asI
RasR asIt
t
,, ,
∑∈=
For every Ss ∈′ ( ) { }ssaasstsasI ttt ′====′ +1,,|,,
( ) ( )( )asI
sasIass
,,,
,|Pr′
=′
( ) ( ) ( ) ( )∑′
′⋅′⋅+=s a
asQassasRasQ ,max,|Pr,, γ
=π Action-Selection-Probability-Update (Q ) Return U-Tree
51
2.2.3.2 The Internal State Refinement process
An internal state s can be refined by selecting the most suitable candidate feature
( )sCFcf ∈ for leaf expansion. The Kolmogorov-Smironov Two Sample test is used
to assess the return distributional difference when a candidate feature cf is
introduced. If the probability of the distributional difference exceeds a predetermined
threshold, the candidate feature currently considered will extend that leaf.
Function Internal-State-Refinement (U-Tree, Η ) returns U-Tree
For each leaf s For each ( )sCFcf ∈ Add a sub-tree at s by cf with potential leaves l For each l For each a ( ) ( ){ }aallalalI tttt ==← ,|,,
( )←alsd ,, KS-2-Sample-test (( ) ( )
⎟⎠⎞⎜
⎝⎛ ∪⎟
⎠⎞⎜
⎝⎛ ∪
∈∈ jalIjiasIiRR
,,, )
( ) ( )∑∑←l a
alsdcfsd ,,,
( )( )cfsdcfcfwin ,max←
If ( ) >wincfsd , predetermined threshold Dϑ For each l ( ) ( )asQalQ ,, ← sl ← wincfss ∪← ( ) ( ) wincfsFsF −← Return TreeU −
52
2.3 Shortcomings
2.3.1 Circular dependency
The U-Tree algorithm shares the most common weakness of the value based RL
paradigms where the model of the environment obtained may bias the optimality of
the policy. During the construction of the U-Tree, the building of the tree depends on
the current Q-value estimates. In turns, the Q-value estimates depend on the current
policy and the policy depends on the current structure of the tree. This circular
dependency has a potential downfall on the convergence upon the action value
estimates.
2.3.2 Pre-selection of the candidate feature
The set of pre-selected candidate features must contain all the observations minimally
required to represent the partially observed environment. The designer must have
some background knowledge of the situation to provide a set of candidate features.
53
3 Proposed method and results
Chapter summary
This chapter presents a variation of the U-Tree algorithm (McCallum, 1996). A new
feature extraction criterion, the Information Gain Ratio (IGR) test, is introduced to the
U-Tree framework for internal state construction.
Section 3.1 introduces the variant of the U-Tree by beginning with a description of the
original U-Tree. This then follows a section that discusses the technical background
of Information Theory and then on how it can be applied to give a new RL algorithm.
Section 3.2 provides the pseudo code for the IGR test U-Tree algorithm in terms of
the internal state refinement process.
A set of three experiments are conducted to assess the IGR test with respect to the K-
S test in growing a U-Tree. In section 3.3, the ANOVA test, which is used to compare
performance between the two U-Tree algorithms, is discussed. Then, the three
problem domains are described. These problem domains include robot navigation,
driving, and elevator control. Each problem domain subsection provides the
description of the environment, the task, the action set, the reward, the candidate
feature set, the training conditions, results and comparisons.
3.1 Introduction to the variant of the U-Tree method
A U-Tree is a tree-structured representation of both the internal states and the policy
for an RL agent. The internal states are refined when relevant observations are
extracted to grow the tree.
Our new variant of the U-Tree serves the same purpose. It differs from the original U-
Tree method in the feature extraction criterion, where the IGR test replaces the K-S
test.
3.1.1 Information Gain Ratio test
The IGR test is feature extraction criterion for decision tree learning (Mitchell, 1997),
corresponding to the C4.5 algorithm (Quinlan, 1993). IGR provides a disparity
measure when a sample is classified by a feature. It measures the expected reduction
in information caused by partitioning the sample according to the feature, with respect
to the homogeneity of that feature.
54
3.1.2 Procedure of the IGR test
Given a set of returns, discretized with possible values { }muuu , , , 21 K and a feature f
with values { }nvvv , , , 21 K
1. Estimate the probability ( )juR =Pr for mj , ,1 K= from the history
2. Calculate the Information (also known as the entropy) ( )RI , which measures
the homogeneity of the returns
( ) ( ) ( )∑=
==−=m
jjj uRuRRI
12 Prlog Pr
3. Calculate the Information Gain (also known as the conditional entropy),
which measures the expected reduction in information caused by partitioning
the return R according to the feature f
( ) ( ) ( ) ( )∑=
==−=n
kkk vfRIvfRIfRIG
1
| Pr|
4. Calculate the Information of the feature f
( ) ( ) ( )∑=
==−=n
jjj vfvffI
12 Prlog Pr
5. Calculate the Information Gain Ratio
( ) ( )( )fI
fRIGfRIGR || =
55
3.2 Construction of a U-Tree by IGR test
In the context of an U-Tree, the IGR test aims to select the most appropriate candidate
feature cf to classify the return set ( ){ }sJjRj ∈: at leaf s . The return values are
discretized to allow the calculation of the information on the return.
Function Internal-State-Refinement (U-Tree, Η ) returns U-Tree
For each leaf s of the U-Tree Retrieve time index set ( )sJ Retrieve return set ( ){ }sJjRj ∈: from the history table h ( )←sR discretized ( ){ }sJjRj ∈: For each candidate feature ( )sCFcf ∈ available at leaf s For each action a Find ( ) ( ) ( ){ }sJtaasRasR t ∈== ,|, Calculate the information ( )[ ]asRI , Calculate the information gain ( )[ ]cfasRIG |, Calculate the information ( )[ ]cfRI Calculate the information gain ratio ( )[ ]cfasRIGR |, ( )[ ] ( )[ ]∑←
acfasRIGRcfsRIGR |,|
( )[ ]( )cfsRIGRcfcf
win |maxarg←
If ( )[ ]wincfsRIGR | predetermined threshold ϑ For each value v of wincf Update Q-value estimate vector ( ) ( )asQavQ ,, ← Update action selection probability vector ( ) ( )asPavP || = Update candidate feature ( ) ( ) wincfsCFvCF −← Update new leaf vs ← Return U-Tree
56
3.3 Experimental results
We now compare the performance between the K-S test U-Tree and the IGR test U-
Tree in three RL problems. The problem domains include a robot navigation problem,
a New York driving problem and an elevator control problem.
The robot soccer navigation problem involves an agent that learns to position itself to
shoot a goal. The New York driving problem requires an agent to avoid traffics by
changing lanes safely (McCallum, 1995). The elevator control problem aim to
maintain passenger flow in a building for three elevators (Singh, Jaakkola & Jordan,
1995).
To allow automatic internal state construction, a set of candidate feature is pre-
selected for feature extraction in each problem. This set of candidate feature is a
product set formed under an observation set and a history index set.
The observation set consists of information, which the agent observes in the system.
The history index set consists of time lags, which indicate the elapsed time steps past
from current. Forming a product set from the observation set and lag set, the candidate
feature set is composed of both current and past observations. For example, the
candidate feature ‘observation X at lag 0’ denotes the current value of observation X;
the candidate feature ‘observation Y at lag 2’ indicates the value of Y observed two
time steps ago.
57
3.3.1 Using the Analysis of Variance statistical test
Analysis of Variance (ANOVA) is a statistical test that can be applied to make
comparisons amongst groups of data. In the context of the U-Tree experimentations,
ANOVA is applied to compare the performance variability.
3.3.1.1 One-way ANOVA
In a simple situation, given n groups of independent samples, each with m
observations; a one-way ANOVA test investigates the variability due to the
differences among the groups of samples by comparing the sample means.
Sample \ Observation 1 2 K m 1 1,1x 2,1x K mx ,1 2 1,2x 2,2x K mx ,2 M M M M M n 1,nx 2,nx K mnx ,
Table 4: Data table for a one-way ANOVA test
15 trials are conducted for each of the U-Tree experiments; the one-way ANOVA
investigates the variability in performance for a U-Tree algorithm amongst the
number of trials.
Trial \ Time 1 2 K 1 1,1p 2,1p K
2 1,2p 2,2p K
M M M M 15 1,15p 2,15p K
Table 5: A U-Tree performance data table for a one-way ANOVA test
The null hypothesis of a one-way ANOVA test assumes that all the samples are drawn
from the same population. The test returns a probability to indicate the likelihood of
the null hypothesis. A small probability returned suggests that at least one sample
mean is significantly different than the other sample means.
58
3.3.1.2 Two-way ANOVA
ANOVA is capable of providing factor analysis to investigate the variability of data
caused by some factors. Given a factor with two levels, A and B, data repeated n
trials, each has m observations for each factor level, a two-way ANOVA test
investigates the variability caused by the different factor levels.
Trial Factor \ Observation 1 2 K m A 1,1x 2,1x K mx ,1 1 B 1,1x 2,1x K mx ,1 A 1,2x 2,2x K mx ,2 2 B 1,2x 2,2x K mx ,2
M M M M M M
A 1,nx 2,nx K mnx , n
B 1,nx 2,nx K mnx ,
Table 6: Data table for a two-way ANOVA test
The two-way ANOVA is used to compare the performances of the two U-Tree
algorithms at the factor levels.
The null hypothesis of a two-way ANOVA test assumes that the different factor levels
have no effects on the samples. The test returns a probability indicating the likelihood
of the null hypothesis. A small probability returned suggests that any significant
variability in the data is contributed by the factor investigated.
59
3.3.2 A robot soccer problem
The Kiks simulator is used to develop the RL system in the robot soccer problem. The
system embeds an agent that learns to align itself to shoot a goal.
3.3.2.1 Description of a robot soccer problem
The environment consists of the following objects
o a field, which is 1200 mm by 700 mm,
o a single goal, which is 300 mm wide,
o a ball, which is 90 mm in diameter and
o the agent, which is 120 mm in diameter.
The position of the ball and that of the agent are randomly initialised.
Figure 11: A snap shot of the robot navigation problem
60
The set of actions
The action set of the soccer agent is composed of four actions. They include 3 motor
actions and 1 sensory action. These actions are listed in the following table. Action Description Move forward The agent moves forward by 10 mm Turn left The agent turns left by 9 degrees Turn right The agent turns right by 9 degrees Observe The agent perform a panoramic view to observe the position of the ball
and that of the goal
Table 7: The actions of the soccer agent
The reward
Two reinforcement functions are used in accordance with the different stages of
learning under the technique of Shaping.
During stage 1 of learning (figure 11), the agent learns to get within a desirable
distance from the ball. The following reinforcement function calculates the reward
proportionally to the absolute difference between the observed ball width from the
agent’s current position and the desired ball width.
21 cwidthwidthcr desiredball +−⋅=
where 21,cc are scaling constants
During stage 2 of learning (figure 11), the agent learns to position itself in a shooting
position given the learnt policy from stage 1. The reinforcement function expressed as
linear combination of the observed ball width and the observed angle between the
centre of the ball and that of the goal.
5/43 canglecwidthcr goalballball +⋅+⋅=
where 543 ,, ccc are scaling constants
61
The set of candidate features
The set of candidate features is a product set of an observation set and a history set.
The observation set describes what the agent observes on the soccer field. The history
index set consists of { }4,3,2,1,0=lag .
Observation Description Value Ball centre The apparent centre of the ball
relative to the agent Front
Rear
RightLeft
0 if Unknown
1 if Left
2 if Front
3 if Right
4 if Rear
Ball width The apparent width of the ball ballw relative to the agent
0 if 0=ballw 1 if mmwmm ball 30 1 <≤ 2 if mmwmm ball 60 31 <≤ 3 if mmwmm ball 90 61 <≤ 4 if mmwball 90≥
Goal centre The apparent centre of the goal relative to the agent
0 if Unknown
1 if Left
2 if Front
3 if Right
4 if Rear
Goal width The apparent width of the goal goalw relative to the agent
0 if 0=goalw 1 if mmwmm ball 100 1 <≤ 2 if mmwmm ball 200 101 <≤ 3 if mmwmm ball 300 201 <≤ 4 if mmwball 300≥
Angle from goal to ball
The apparent angle from the centre of the goal to that of the ball gab∠ relative to the agent
( ){ }ngabnn 66 1: ππ ≤∠<− , 12,,2,1 L=n
Previous action The previous action taken Move forward Turn left Turn right Observe
Last observe An interval indication of the number of time step t past since last observe action
1 if 50 <≤ t 2 if 105 <≤ t 3 if 10≥t
Random number A number randomly generated { }10,,1 K
Table 8: The set of observations in the robot soccer problem
62
Training conditions
Each training session is consisted of 200 episodes whilst each episode contains 50
iterations. An episode terminates if the agent achieves the tasks or if the maximum
number of iterations allowed is reached. The various parameters of the experiment are
listed. Note that a hand-crafted policy is used to obtain the most efficient set of
parameters for this experimental purpose.
Parameter Description Value
Exploration rate Rate that indicates the probability of
choosing random action in a ε -greedy
policy
⎩⎨⎧
=r thereafte0.15
10~1 epfor 1ε
Discount rate Rate that discounts the future rewards in
return computation
7.0=γ
Learning rate for
action value
Rate that determines the change ratio of the
action values with respect to new
experience
05.0=qβ
Learning rate for
action preference
Rate that determines the change ratio of the
action selection preferences with respect to
new experience
1.0=pβ
Frequency for
action value
update
The regularity of updating the action values,
expressed in terms of the number of
episodes
2=qfreq
Frequency for
action preference
update
The regularity of updating the action
selection preferences, expressed in terms of
the number of episodes
4=pfreq
Frequency for
internal state
refinement
The regularity of refining the internal states,
expressed in terms of the number of
episodes
5=sfreq
K-S test critical
region
Indicates the threshold probability ( )p−1
that the test statistic must exceed to reject
0H
1.0=p
Table 9: The set of parameters used in the robot soccer problem
63
The technique of Shaping is used in the robot soccer problem. Two stages of learning
with the use of two reinforcement functions are involved to solve the problem. The
first stage requires the agent to learn to get close to the ball. The second stage teaches
the agent to position to shoot a goal. The transition of learning from stage 1 to stage 2
conditions on the number of episode passed since the U-Tree was previously grown.
If the U-Tree exhibits no further development for a period of 40 episodes, stage 2
learning is initiated with the respective reinforcement function.
At the beginning of each episode, the agent observes its environment as its first
default action. The first 10 episodes are used for experience gathering. The U-Tree
begins its development at the end of the 10th episode and is checked for improvement
every 5 episodes.
3.3.2.2 Results
Both the K-S test and the IGR test used in the U-Tree algorithm succeeded in
achieving the objective. The RL agent, using the internal state tree structure obtained,
is capable at positioning itself to shoot a goal after training.
The sets of candidate features used for tree construction by the two algorithms are
similar. Both algorithms select the candidate feature ‘current ball position’ as the root
of the tree. The candidate feature ‘current ball position’ represents the location of the
ball centre respective to the panoramic view of the agent. It is logical that the agent
must find out where the ball is at present, in order to get close to it.
During tree construction, although the two algorithms choose a similar set of
candidate feature, the final tree structure outputs are different in size and order. Both
of the algorithms have not selected the random number feature to build the tree.
The performance comparison is made on the number of training steps required to
obtain an internal state tree structure, which enables the learning of an optimal policy.
Under this criterion, the K-S test U-Tree algorithm is found slower than the
Information Gain tree algorithm by 5 episode length on average.
64
Results from the K-S test U-Tree algorithm
The following figure of success rate demonstrates the performance of the U-Tree
algorithm, which uses the K-S test for feature extraction in solving the robot
navigation problem.
Figure 12: Robot navigation success rate per episode by K-S test
When experience is collected for learning from episode 1 to 10, a random policy is
used for exploration purpose. The performance of that period is poor.
Once the construction of the U-Tree has started, the performance is quickly improved.
At episode 50, the performance degraded upon the start of stage 2 learning. The
degradation in performance can be explained that the agent attempted to accomplish
the objective of stage 2 by using the tree structure learnt from stage 1. This internal
state representing tree structure is refined according to the given reinforcement
function under stage 2 learning. A good policy, which provides a high success rate in
position the agent in a shooting position, is obtained episode 70.
One-way ANOVA test indicates that the variability of the results is not significant at
95% confident level. In other words, there is a one in twenty chance that one or more
than one out of the fifteen trials shows significantly different results than the others, in
the robot navigation problem.
65
Results from the IGR test U-Tree algorithm
The figure shown below demonstrates the performance of the U-Tree algorithm with
the use of the IGR test in solving the robot navigation problem.
Figure 13: Robot navigation success rate per episode by IGR test
A random policy is used for exploration purpose from episode 1 to 10 and poor
performance results.
Upon the U-Tree is used, the performance improves quickly. The performance
degradation at episode 50 is a result of the transition in learning stages of different
objective. A good policy is obtained at episode 65 to provide a high success rate in
position the agent at a shooting position.
The variability of the results is not significant at 95% confident level, shown by the
one-way ANOVA test.
66
Comparison
Both the feature extraction criteria, the K-S test and the IGR test, have equivalent
efficiency in constructing a U-Tree to solve the robot navigation problem. This is
illustrated in the following diagram, which shows the cumulative success rates for the
two tests, during the different phases of the robot navigation problem, over 15 trials.
Figure 14: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes in the robot navigation problem
The observation is made under the following conditions.
o The internal state refinement process of the U-Tree algorithm occurs at a
frequency of every 5 episodes starting from episode 10
o The stage 2 of learning is invoked when a U-Tree has not been refined after 40
episodes
Under these given conditions, a two-way ANOVA test is used to investigate the
performance difference between the K-S test and the IGR test in terms of success rate.
The ANOVA test shows that the performance difference is insignificant at 99%
confidence level. In other words, there is a 1/100 chance that the performance
between the K-S test and the IGR test U-Tree is different in the robot soccer problem.
To further assess the two criteria, comparison is made in terms of the duration
required to accomplish a good policy. Experiments are conducted with the changes
made in the internal state refinement frequency as follow.
o Every 3 episodes
o Every 4 episodes
o Every 5 episodes
Since the objective of stage 1 is relatively simple, the stage 2 of learning is invoked
when an existing U-Tree remains unchanged for duration of 20 episodes.
67
The following figures illustrate the performance of the K-S test and the IGR test with
respect to the different internal state refinement frequencies.
Figure 15: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes at the internal state refinement frequency = 5
68
Figure 16: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes at the internal state refinement frequency = 4
Figure 17: Cumulative success rates, respective to the different stage of learning, of the K-S and
IGR test over 200 episodes at the internal state refinement frequency = 3
69
The decrease in internal state refinement frequency lessens the experience collected
for refining an internal state space. Diagrammatically, this results in slight
degradation in performance of the K-S test U-Tree. However, the two-way ANOVA
test confirms that such difference is insignificant at 99% confidence level.
The following figures represent the coefficient of variation for the K-S statistics and
that for the IGR, respective to different length of experience (window size) collected.
The coefficient of variation is the ratio of the sample standard deviation to the sample
mean. It measures the spread of a sample as a proportion of its mean, expressed as a
percentage. A large coefficient of variation indicates high variability of the sample.
Therefore, the coefficient of variation can be used to illustrate the variability of the K-
S statistics and that of the IGR during feature extraction. The comparison is made in
extracting the first significant feature in the robot navigation problem.
Figure 18: The coefficient of variation for the K-S statistics during the first feature extraction
process in the robot navigation problem
70
Figure 19: The coefficient of variation for the IGR during the first feature extraction process in the robot navigation problem
Both the tests extract the current ball position correctly in this situation. However, the
K-S test exhibits a larger variance in the statistics calculated, especially when
experience is scarce, than the IGR test. In other words, the IGR test U-Tree is able to
learn more stably because the IGR test displays lower variation in return prediction
than the K-S test.
71
3.3.3 A New York driving problem
In the New York driving problem (McCallum, 1995), a Q-Learning agent learns to
drive. In order to drive safely, the agent must learn when and how to change lane to
avoid traffics.
3.3.3.1 Description of a New York driving problem
The environment of the New York driving problem is a one-way road, which consists
of four lanes traffic. The traffic includes the agent’s vehicle and other trucks. The
objective of the agent is to avoid collision with the trucks as it makes forward
progression.
The agent travels at a speed of 16 meters per seconds. It has a visual horizon of 66
meters in front and behind of the view point of the agent’s vehicle and it is capable of
changing lanes.
There are two types of trucks, the slow trucks and the fast trucks. The slow trucks
travel at a speed of 12 meters per seconds and they appear in front of the agent’s
vehicle. The fast trucks travel at a speed of 20 meters per second and they appear
behind the agent’s vehicle. All the trucks are not capable of changing lanes.
During the driving simulation, time is discrete at the resolution of half a second per
step. At each time step, the traffic positions are updated according to their speed.
Throughout the time step in which the agent’s vehicle changes lanes, it both shifts
lanes and moves forward its normal distance. At each time step, a new truck appears
in randomly selected lanes and both types of trucks are equally probable to come into
view.
Figure 20: A snap shot of the New York driving problem
72
The set of actions
The action set of the driving agent consists of 7 actions, 6 sensory actions and 1 motor
action and the actions are listed in the following table. Action Description
Observe forward left Look forward at closest truck in the left lane to the agent
Observe forward Look forward at closest truck in the current lane of the agent
Observe forward right Look forward at closest truck in the right lane to the agent
Observe backward left Look backward at closest truck in the left lane to the agent
Observe backward Look backward at closest truck in the current lane of the agent
Observe backward right Look backward at closest truck in the right lane to the agent
Move to observed lane Move to the lane previously observed
Note that the agent remains in the same lane when it intends to move
right / left on the right / left outer lane
Table 10: The actions of the driving agent
The agent uses these seven actions to navigate to avoid the traffic. It intends not to run
into slow trucks in front and not to be reached by fast trucks from behind. When the
agent runs into the rear of a slow truck in its lane, it performs a squeeze by scraping
the side of the truck to move forward. When the agent is reached by a fast truck from
behind, the fast truck begins to beep its horn until the agent moves out of its way.
The reward
In order to avoid the ‘squeeze’ and the ‘honk’ circumstances, the reinforcement
function delivers one of the three possible rewards at each time steps.
⎪⎩
⎪⎨
⎧
−−=
squeeze - 10 honked - 1
progressclear - 1.0r
73
The set of candidate features
The observation set describes what the agent observes on the road. The history index
set consists of { }4,3,2,1,0=lag .
Observation Description Value Agent lane The lane of the agent { }4,3,2,1 Closest front Distance of the closest truck in
front of the agent 1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck
Closest front left Distance of the closest truck on the front left of the agent
1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck
Closest front right Distance of the closest truck on the front right of the agent
1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck
Closest rear Distance of the closest truck behind the agent
1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck
Closest rear left Distance of the closest truck on the rear left of the agent
1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck
Closest rear right Distance of the closest truck on the rear right of the agent
1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck
Hear horn Being honked Yes, No Previous action The previous action taken Observe forward left
Observe forward Observe forward right Observe backward left Observe backward Observe backward right Move to observed lane
Random number 3 A number randomly generated { }3,2,1 Random number 5 A number randomly generated { }5,4,3,2,1
Table 11: The set of observations in the New York driving problem
74
Training conditions
Note that a hand-crafted policy is used to obtain the most efficient set of parameters
for this experimental purpose.
Parameter Description Value
Exploration
rate
Rate that indicates the probability of
choosing random action in a ε -greedy
policy ⎪⎪⎩
⎪⎪⎨
⎧
>≤<≤<
≤
=
8000 t,1.08000t6001 ,2.06000t2001 ,4.0
2000 t,0.1
ε
Discount rate Rate that discounts the future rewards in
return computation
75.0=γ
Learning rate
for action
value
Rate that determines the change ratio of
the action values with respect to new
experience
05.0=qβ
Learning rate
for action
preference
Rate that determines the change ratio of
the action selection preferences with
respect to new experience
075.0=pβ
Frequency for
action value
update
The regularity of updating the action
values, expressed in terms of the number
of time steps
100=qfreq
Frequency for
action
preference
update
The regularity of updating the action
selection preferences, expressed in terms
of the number of time steps
200=pfreq
Frequency for
internal state
refinement
The regularity of refining the internal
states, expressed in terms of the number
of time steps
500=sfreq
K-S test
critical region
Indicates the threshold probability
( )p−1 that the test statistic must exceed
to reject 0H
01.0=p
Table 12: The set of parameters used in the New York driving problem
75
3.3.3.2 Results
In the New York driving problem, the agent explores with a random policy in the first
2000 time steps of a 20000 time step training session. Under the random policy, the
probability of having an accident is approximately 25%.
The U-Tree construction process starts at the 2000th step and the refinement process
of the internal states is accomplished every 500 steps onwards. At the end of training,
the policies obtained from the U-Trees constructed under both the K-S test and the
IGR test have reduced the number of accident counts, scrap or honk, to 1/6 of the
accidents made under a random policy. Unfortunately, both policies are not capable of
providing absolute safe driving in the problem.
Both the U-Trees commonly extract the ‘current closest front’ feature as the first
feature. This is logical because scrapping with a slow truck incurs the heaviest penalty
and it is a situation that can be trivially avoided.
Results from the K-S test U-Tree algorithm
The following figure illustrates the average accident counts for successive periods of
2000 time steps in length, for a K-S test U-Tree over 15 trials. Radical reduction in
accidents occurs between the 4000th and the 7000th step.
Figure 21: Average accident counts per every 2000 steps for a K-S test U-Tree
One-way ANOVA test indicates that the variability of the results over the 15 trials is
not significant at 95% confident level.
76
3.3.2.2.2 Results from the IGR test U-Tree algorithm
A similar figure demonstrates the average accident counts for an IGR test U-Tree over
15 trials. Accidents are significantly reduced between the 4000th and the 7000th step.
Figure 22: Average accident counts per every 2000 steps for a IGR test U-Tree
The variability of the results is not significant at 95% confident level, shown by the
one-way ANOVA test.
77
Comparison
The objective of the agent is to avoid two situations, being honked by trucks from rear
and scraping trucks in front. Performance comparison between the K-S test and the
IGR test U-Tree algorithm is made in terms of the honk count and scrape count.
The figure below compares the honk counts between the two U-Tree algorithms for
successive intervals of 2000 time steps in length over 15 trials. ANOVA test confirms
that the honk avoidance performance between the two U-Tree algorithms is
equivalent at 99% confidence level (P-value = 0.0064).
Figure 23: Average honk count of the K-S and IGR test per every 2000 steps
The following figure illustrates the scrap counts obtained for the U-Tree algorithms
for successive intervals of 2000 time steps in length over 15 trials. ANOVA test
concludes that the difference in scrap avoidance performance between the two U-Tree
algorithms is significant at 99% confidence level (P-value = 0.031).
Figure 24: Average scrap count of the K-S and IGR test per every 2000 steps
Diagrammatically, it shows that the average scrap counts of the IGR test U-Tree drops
significantly from the 4000th step. This count remains lower than that of the K-S test
78
U-Tree until the 16000th step when the performance of the two U-Trees becomes
indifferent.
This illustrates that the IGR test U-Tree is capable to extract relevant features in
earlier stage of training. In other words, the IGR test U-Tree requires less experience
and learns more quickly than the K-S test U-Tree. And this can be explained in the
following figures.
The figures below show the K-S statistics and the IGR observed, respective to
different length of experience collected, in extracting the first significant feature.
Figure 25: The K-S statistics for the first feature extraction process in the New York driving
problem
Figure 26: The IGR for the first feature extraction process in the New York driving problem
The K-S statistics plot (figure 26) shows clearly that the K-S test is unable to
differentiate and extract relevant feature given the time step scales in the plot. On the
other hand, the IGR test (figure 27) depicts information ‘the closest truck in front at
79
present’ to grow a U-Tree. This further strengthens the implication that the IGR test
U-Tree learns more quickly than the K-S test U-Tree.
80
3.3.4 An elevator control problem
In the elevator control problem, a Q-learning agent controls three elevators to
maintain the flow of passengers in a building of 10 floors. The agent aims to deliver
the arrivals to their destinations quickly.
3.3.4.1 Description of an elevator control problem
The environment is a simulation of 10-floor building with three elevators. Each
elevator has a maximum capacity of 10 passengers and must stop on a floor to unload
and upload any passenger.
Figure 27: A snap shot of the elevator control problem
The arrival number of passengers N on each floor at each time step, is controlled a
Poisson distribution with the probability function as follows.
( )!
Prn
nenN
λλ −
== where λ is the parametric rate
The maximum number of passengers allowed in the building is 40. For each
passenger arrived, the destination is equally probable on any floor. A passenger enters
an elevator only if the elevator travels in the direction of the passenger’s destination.
A passenger exits an elevator when the elevator stops at the passenger’s destination.
81
The set of actions
The set of actions for the elevator central control is a triple product of the action set of
each elevator. The three elevators are identical. The actions available for each elevator
are described in the following table.
Action Description Stay Stay on the current floor to upload and unload any passenger Move up Move up one floor Move down Move down one floor
Table 13: The actions of an individual elevator
Forming a triple product set from the actions of an individual elevator, the elevator
central control action set is composed of 27 actions. Two examples are given in the
following table.
Central control action Description Stay, Stay, Stay The three elevators stay on the current floor Move up, Move down, Stay Elevator 1 moves up one floor
Elevator 2 moves down one floor Elevator 3 stays on the same floor
Table 14: Examples of the central control action
The reward
The reinforcement function is negatively proportional to the total waiting time
incurred by all passengers found in the building and the elevators.
With a single passenger in the building, the reinforcement is given as follow.
⎩⎨⎧
>−≤−
=40 if 340 if
waitwait
waitwait
tttt
r
where waitt is the waiting time of a particular passenger
When more than one passenger arrives, the reinforcement function of the system is
expressed as the sum of the reinforcement contributed by each passenger.
82
The set of candidate features
Only current observation is used because memory is not required. Observation Description Value Motion 1 Motion of elevator 1 Up, Down, stop Motion 2 Motion of elevator 2 Up, Down, stop Motion 3 Motion of elevator 3 Up, Down, stop Occupancy 1 Occupancy of elevator 1 Empty, Not full, Full Occupancy 2 Occupancy of elevator 2 Empty, Not full, Full Occupancy 3 Occupancy of elevator 3 Empty, Not full, Full Passenger out 1 Presence of exiting passenger on
current floor of elevator 1 Yes, No
Passenger out 2 Presence of exiting passenger on current floor of elevator 2
Yes, No
Passenger out 3 Presence of exiting passenger on current floor of elevator 3
Yes, No
Wait above 1 Presence of passenger waiting on floors above elevator 1
Yes, No
Wait above 2 Presence of passenger waiting on floors above elevator 2
Yes, No
Wait above 3 Presence of passenger waiting on floors above elevator 3
Yes, No
Wait below 1 Presence of passenger waiting on floors below elevator 1
Yes, No
Wait below 2 Presence of passenger waiting on floors below elevator 2
Yes, No
Wait below 3 Presence of passenger waiting on floors below elevator 3
Yes, No
Long wait above 1 Presence of long waited passenger waiting on floors above elevator 1
Yes, No
Long wait above 2 Presence of long waited passenger waiting on floors above elevator 2
Yes, No
Long wait above 13 Presence of long waited passenger waiting on floors above elevator 3
Yes, No
Long wait below 1 Presence of long waited passenger waiting on floors below elevator 1
Yes, No
Long wait below 2 Presence of long waited passenger waiting on floors below elevator 2
Yes, No
Long wait below 3 Presence of long waited passenger waiting on floors below elevator 3
Yes, No
Long waited portion up 1
Percentage of long waited passenger going up in elevator 1
1 if %50< 2 if %50≥
Long waited portion up 2
Percentage of long waited passenger going up in elevator 2
1 if %50< 2 if %50≥
Long waited portion up 3
Percentage of long waited passenger going up in elevator 3
1 if %50< 2 if %50≥
Random number 3 A number randomly generated { }3,2,1 Random number 5 A number randomly generated { }5,4,3,2,1
Table 15: The set of observation in the elevator control problem
83
Training conditions
Note that a hand-crafted policy is used to obtain the most efficient set of parameters
for this experimental purpose. Parameter Description Value
Exploration
rate
Rate that indicates the probability of choosing
random action in a ε - greedy policy
⎪⎪⎩
⎪⎪⎨
⎧
>≤<≤<
≤
=
00051 t,1.000051t10001 ,2.0
10000t5001 ,4.05000 t,0.1
ε
Discount rate Rate that discounts the future rewards in return
computation 7.0=γ
Learning rate
for action value
Rate that determines the change ratio of the
action values with respect to new experience 05.0=qβ
Learning rate
for action
preference
Rate that determines the change ratio of the
action selection preferences with respect to new
experience
05.0=pβ
Frequency for
action value
update
The regularity of updating the action values,
expressed in terms of the number of time steps 150=qfreq
Frequency for
action
preference
update
The regularity of updating the action selection
preferences, expressed in terms of the number
of time steps
250=pfreq
Frequency for
internal state
refinement
The regularity of refining the internal states,
expressed in terms of the number of time steps 500=sfreq
K-S test critical
region Indicates the threshold probability ( )p−1 that
the test statistic must exceed to reject 0H
01.0=p
Poisson mean Passenger arrival rate parameter 8.2=λ
Table 16: The set of parameters used in the elevator control problem
During the phase of internal state refinement, the most relevant feature is extracted to
refine an internal state. Unlike the previous two experiments, the robot navigation and
the New York driving problem, the refinement process does not stop after the most
relevant feature is extracted in this elevator control problem. The new internal states
resulting from the extracted feature will be examined for refinement until further
partition is not possible.
84
3.3.4 2 Results
In the elevator control problem, the agent explores with a random policy in the first
5000 time steps of a 20000 time step training session. Under the random policy, the
average waiting time of a passenger is approximately 25 seconds.
The U-Tree construction process starts at the 5000th step and the refinement process
of the internal states is accomplished every 500 steps onwards. At the end of training,
the policies obtained from the U-Trees constructed under both the K-S test and the
IGR test have reduced the average waiting time, to approximately 5 seconds. This
indicates an improvement of 80% in average waiting time in comparison with a
random policy.
Results from the K-S test U-Tree algorithm
Figure 28 shows the average waiting time per passenger under the K-S test U-Tree in
the elevator control problem. When the U-Tree construction starts, the U-Tree reduces
the average waiting time by approximately 8 minutes initially. Gradual reduction
occurs until the end of training. Note that the steps-wise improvement in performance
reflects the duration taken for a relevant feature to be extracted into the tree-structured
policy.
Figure 28: Average waiting time of passengers under the K-S test U-Tree over 15 trials
85
Results from the IGR U-Tree algorithm
A similar figure demonstrates the average waiting time per passenger under the IGR
test U-Tree in the elevator control problem. Upon the U-Tree construction, the
average waiting time is initially reduced by approximately 12 minutes with gradual
reduction follows until the end of training. Note that the steps-wise improvement in
performance reflects the duration taken for a relevant feature to be extracted into the
tree-structured policy.
Figure 29: Average waiting time of passengers under the IGR test U-Tree over 15 trials
Comparison
Both algorithms are capable of providing a tree-structured policy that lead to stepwise
improvement. Although a two-way ANOVA test concludes that no difference exists
in performance between the K-S test and the IGR test U-Trees at 99% confidence
level, it is shown that the IGR test is capable to extract more relevant feature during
earlier training. This indicates that the IGR test U-Tree can learn more rapidly than
the K-S test U-Tree.
86
4 Conclusion In this thesis, automatic state construction problems have been investigated to
illustrate practical value of automatic state construction in real RL applications.
Decision tree technique is applied in state construction to result a tree-structured
(internal) state representation and policy.
Automatic state construction allows an RL agent to progressively learn and refine its
own state representation when the current state of the environment does not match the
input. During state construction, only task relevant observations are extracted, which
results in a simpler and more compact state representation than if the state space was
defined over all the possible combinations of observations. A large state space can
cause scalability problems to RL algorithms and automatic state construction helps to
reduce the occurrence of such problem.
Task relevant observations include observations made in the past. When a past
observation is extracted for state construction, a form of short-term memory is
allowed into the behaviour of the agent. The policy of the agent is said to be reactive
to the states, which are defined by relevant observations from both present and the
past.
The U-Tree algorithm (McCallum, 1995) is an RL method, which uses a decision tree
technique in state construction. The K-S test is the leaf expansion criterion, which
evaluates the return distributional difference when a feature is introduced at a leaf. We
presented a variant of the U-Tree, which uses the IGR test as the leaf expansion
criterion. The IGR test measures the disparity of the returns being classified by a
feature at a leaf.
In discrete domains, the state construction functionality of both the K-S test and the
IGR test U-Tree is experimentally demonstrated. The major advantage of our IGR test
U-Tree approach is that the IGR test U-Tree produces a more robust state
representation and enables faster learning. This better performance can be explained
by the fact that the IGR exhibits a lower variability in return prediction than the K-S
test used in the original U-Tree. Future research can be oriented in the generalization
of the U-Tree approach in continuous domain.
87
5 Contributions In my master project, the application of decision tree technique in automatic state
construction is investigated. This involves the implementation of the simulations and
the tree structure. The U-Tree framework is studied and IGR test is applied to produce
a variant of the U-Tree. Three set of experiments were conducted to compare the
original U-Tree and the new variant. Performances were evaluated and the results
were accepted in the CIMCA conference 2004.
88
6 References Abe & Nakamura (1999). Learning to Optimally Schedule Internet Banner Advertisements. ICML 1999: 12-21 Aberdeen,D., (2002). A survey of approximate methods for solving POMDPs. RSISE, Australian National University. Aberdeen, D., & Baxter, J. (2002). Internal state policy gradient algorithms for infinite horizon pomdps (Technical Report). RSISE, Australian National University. http://discus.anu.edu.au/~daa/papers.html. Astrom, K.J. (1965). Optimal control of Markov decision processes with incomplete state estimation. Journal of Mathematical Analysis and Application, 10. Baird, L.C., & Moore, A.W. (1999). Gradient descent for general reinforcement learning. Advances in Neural Information Processing System 11. MIT Press. http://www.cs.cmu.edu/~eduleemon/paper/index.html. Bakker, B. (2001). Reinforcement learning with lstm in non-markovian tasks with long term dependencies (Technical Report). Leiden Univerisity. http://fsw.leidenuniv.nl/www/w3_func/bbaker/abstracts.htm.
Bauer, H.-U., & Pawelzik, K. (1992). Quantifying the neighbourhood preservation of self-organising maps, IEEE Transactions on Neural Networks, 3(4), 570-579.
Bauer, H.-U., & Villmann, Th. (1995). Growing a hypercubical output space in a self-organising feature map. Technical Report TR-95-030, ICSI, Berkeley, July.
Bellman, R. (1957). Dynamic programming. Princeton, N.J.:Princeton University Press. Blythe, J. (1999) Decision-theoretic Planning. AI Magazine, 1. http://www.isi.edu/~blythe/papers/asmag.html. Cassandra, A. (1998). Exact and approximate algorithms for partially observable Markov decision processes. Doctoral dissertation, Brown University. Cassandra, A,R. (1999). POMDPs for Dummies: POMDPs and their algorithms, sans formula http://www.cs.brown.edu/research/ai/pomdp/tutorial/iindex.html Cassandra, A.R., Kaelbling, L.P., & Littman, M.L. (1994). Acting optimally in partially observable stochastic domains. Proceedings of the Twelfth National Conference on Artificial Intelligence. Seattle, WA. Crites, R.H., and Barto, A.G. (1996). Improving elevator performance using reinforcement learning. In D.S. Touretzky, M.C. Mozer, M.E. Hasselmo (eds.), Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1017-1023. MIT Press, Cambridge, MA.
89
Dietterich, T.G. (2000). An overview of MAXQ hierarchical reinforcement learning. SARA (pp.26-44). Dietterich & Wang (2002). Batch value function approximation via support vectors. Accepted for publication in Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.) Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Dutch, A. (1998). Solving POMDPs using selected past events. Proceedings of the 14th European Conference on Artificial Intelligence. Gullapalli, V. (1992). A Comparison of Supervised and Reinforcement Learning Methods on a Reinforcement Learning Task. Computer and Information Science Department, University of Massachusetts. Hausen, E.A. (1998). Solving POMDPs by searching in policy space. Eighth Conference on Uncertainty in Artificial Intelligence. (pp.211-219). Madison, WI. Hauskrecht, M. (2000). Value function approximations for partially observable markov decision process. Journal of Artificial Intelligence Research 13, 33-94 Hernandez-Gardiol, N. & Mahadevan, S. (2000). Hierarchical Memory-Based Reinforcement Learning. Proceedings of Neural Information Processing Systems, 2001. Hochreiter, S., & Schmidhuber, J. (1997). Long short term memory. Neural Computation, 9, 1735-1780. Holland, J.H. (1992). Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor. Kaelbling, L.P., Littman, L.M., & Moore, W.A. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 237-285 Lanzi, P.L. (2000). Adaptive agents with reinforcement learning and internal memory. Sixth Internal Conference on the Simulation of Adaptive Behavior (SAB2000). Lin, L.J. & Mitchell, T.M. (1992). Memory approaches to reinforcement learning in non-Markovian domains (Technical Report CS-92-138). Carbegie Nellon, Pittsburgh, PA. Littman, M.L. & Sun, R. (2000). Value-function reinforcement learning in Markov games. Journal of Cognitive Research, 2:55-66, 2001 Lovejoy, W.S. (1991). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28, 47-65.
90
McCallum, A.K. (1995). Learning to Use Selective Attention and Short-Tern Memory in Sequential Tasks. Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, The MIT Press, pp. 315—324. McCallum, A.K. (1996). Reinforcement learning with selective perception and hidden states. Doctoral dissertation, University of Rochester. Meuleau, N., Peshkin, L., Kim, K.E., & Kaelbling, L.P. (1999). Learning Finite State Controller for Partially Observable Environments. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. KTH, Sweden:Morgan Kaufmann. Meuleau, N., Kim, K.E., Kaelbling, L.P. & Cassandra, A.R. (1999). Learning Finite State Controller for Partially Observable Environment. Mitchell, T.M., (1997). Machine learning. New York: McGraw-Hill. Murphy, K.P. (2000). A Survey of POMDP Solution Techniques (Technical Report). Dept. of Computer Science, U.C.Berkeley. Ng, A.Y. & Jordon, M. PEGASUS: A policy search method for large MPDs and POMDPs. In C.Boutilier and M. Goldszmidt, editors, Proceedings of the 16th Conference on Machine Learning, pages 278-287. Morgan Kaufmann, San Francisco, CA, 1999. Peshkin, L.M. (2000). Thesis proposal: Architecture for policy search. Puterman, M.L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York. Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann. Schmidhuber, J.H. (1991). Reinforcement learning in markovian and non-markovian environments. Advances in Neural Information Processing Systems (pp. 500-506). Morgan Kaufmann Publishers, Inc. Schneider (1999). Distributed Value Functions. Proc. 16th International Conf. on Machine Learning. Selfidge, O.J., Sutton, R.S., & Barto, A.G. (1985). Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110:568-582. Shelton, C.R. (2001) Policy improvement for POMDPs using normalized importance sampling (Technical Report AI Memo 2001-002). MIT, Cambridge, MA. Shapiro, D., Langley, P., & Shachter, R. (2000). Using Background Knowledge to Speed Reinforcement Learning in Physical Agents. Proceedings of the Fifth International Conference on Autonomous Agents.
91
Singh, S.P., Jaakkola, T., & Jordan, M.I. (1995). Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems (pp. 361-368). The MIT Press. Strens, M.J.A., & Moore, A.W. (2002). Policy Search using Paired Comparisons. Journal of Machine Learning Research 3 (2002), 921-950 Sutton, R.S., & Barto, A.G. (1998). Reinforcement Learning: An Introduction. Cambridge MA: MIT Press. Uther, W.T.B., & Veloso, M.M., (1996). Tree Based Discretization for Continuous State Space Reinforcement Learning. Proceedings of AAAI-98, Madison, WI, 1998. Walkins, C.J.C.H., & Dayan, P. (1992). Q-learning, Machine Learning, (pp249-292). Wang & Dietterich (2002). Stabilizing value function approximation with the BFBP algorithm. Accepted for publication, Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.) Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Williams, J.K., & Singh, S. (1998). Experimental results on learning stochastic memoryless policies for partially observable Markov decision processes. Advances in Neural Information Processing Systems. Zhang, W., and Dietterich, T.G. (1996). High-performance job-shop scheduling with time-delay TD(λ ) network. In D.S. Touretzky, M.C. Mozer, M.E. Hasselmo (eds.), Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1024-1030. MIT Press, Cambridge, MA.