Manix Au - Automatic State Construction using DT for RL agen · Manix Au 04.03.2004 . 8 Introduction To The Thesis Reinforcement Learning Reinforcement Learning (RL) is a computational

Title: Automatic State Construction using Decision Trees for Reinforcement Learning Agents

Manix Au

F.I.T. / C.I.T.I.

2004

Master in I.T. (Research)

The Statement of original authorship The work contained in this thesis has not been previously submitted for a degree or

diploma at any higher education institution. To the best of my knowledge and belief, the

thesis contains no material previously published or written by another except where due

reference is made.

Signature:

Date:

1

Key words

Reinforcement learning

State

Action

Reward

Policy

Value based method

Policy search method

Direct gradient method

Automatic state construction

Decision tree

Partial observability

U-Tree

Kolmogorov-Smirnov two sample test

Information gain ratio test

2

Table of Content ABSTRACT 6 ACKNOWLEDGEMENT 7 INTRODUCTION TO THE THESIS 8

REINFORCEMENT LEARNING 8 SCOPE OF THE STUDY 8 OVERVIEW 9 ORIGINAL CONTIRBUTIONS 10

1 INTRODUCTION TO REINFORCEMENT LEARNING 11 CHAPTER SUMMARY 11 1.1 WHAT IS REINFORCEMENT LEARNING? 11 1.2 THE ARCHITECTURE OF A RL SYSTEM 12

1.2.1 The agent and the environment 12 1.2.2 Overview of the agent-environment framework 13 1.2.3 The learning structures of a typical RL problem 13

1.3 DIFFERENT TYPES OF THE ENVIRONMENT 14 1.3.1 The nature of the state space of the environment 14 1.3.2 The observability of the environment 14 1.3.3 The availability of a model of the environment 15

1.4 THE LEARNING ASPECT 16 1.4.1 The policy, the return and the value function 16 1.4.2 The key to learning 18 1.4.3 An example: A finite grid environment 21 1.4.4 Relationship between the optimal value functions and the optimal policy 23 1.4.5 The different techniques in solving RL problems 24

1.5 ANCILLARY ISSUES 35 1.5.1 Exploration–exploitation dilemma 35 1.5.2 Temporal credit assignment 35 1.5.3 Shaping 36

1.6 RL APPLICATIONS 37 1.6.1 Cellular communication system 37 1.6.2 Others 37

1.7 RESEARCH CHALLENGES IN RL 38 1.7.1 Scaling up to large problems 38 1.7.2 Partial observability of the environment 39

1.8 AUTOMATIC STATE CONSTRUCTION 40 2 U-TREE: A RL ALGORITHM WITH AUTOMATIC STATE CONSTRUCTION FUNCTIONALITY 41

CHAPTER SUMMARY 41 2.1 INTRODUCTION TO THE U-TREE ALGORITHM 41

2.1.1 Problem domain targeted by the U-Tree 42 2.1.2 The architecture of a U-Tree 42 2.1.3 Feature extraction 46

2.2 CONSTRUCTION OF A U-TREE 49 2.2.1 The U-Tree learning algorithm 49 2.2.2 Using the tree 50 2.2.3 Improving the tree 50

2.3 SHORTCOMINGS 52 2.3.1 Circular dependency 52 2.3.2 Pre-selection of the candidate feature 52

3 PROPOSED METHOD AND RESULTS 53 CHAPTER SUMMARY 53 3.1 INTRODUCTION TO THE VARIANT OF THE U-TREE METHOD 53

3.1.1 Information Gain Ratio test 53 3.1.2 Procedure of the IGR test 54

3

3.2 CONSTRUCTION OF A U-TREE BY IGR TEST 55 3.3 EXPERIMENTAL RESULTS 56

3.3.1 Using the Analysis of Variance statistical test 57 3.3.2 A robot soccer problem 59 3.3.3 A New York driving problem 71 3.3.4 An elevator control problem 80

4 CONCLUSION 86 5 CONTRIBUTIONS 87 6 REFERENCES 88

4

Illustrations and diagrams Figure 1: A finite grid problem ...........................................................................................................21 Figure 2: Value function under random policy of a finite grid problem .........................................21 Figure 3: Value function under optimal policy of a finite grid problem..........................................22 Figure 4: Optimal policy diagram of a finite grid problem ..............................................................22 Figure 5: Example of return realization in training history .............................................................32 Figure 6: State transition dynamic under POMDP...........................................................................39 Figure 7: A decision tree for playing tennis according to the sky outlook.......................................43 Figure 8: A U-Tree for a soccer agent to get to a ball .......................................................................44 Figure 9: Partition of a state including returns during leaf expansion............................................46 Figure 10: Probability density function of the 2χ distribution.......................................................48 Figure 11: A snap shot of the robot navigation problem ..................................................................59 Figure 12: Robot navigation success rate per episode by K-S test ...................................................64 Figure 13: Robot navigation success rate per episode by IGR test ..................................................65 Figure 14: Cumulative success rates, respective to the different stage of learning, of the K-S and

IGR test over 200 episodes in the robot navigation problem ..................................................66 Figure 15: Cumulative success rates, respective to the different stage of learning, of the K-S and

IGR test over 200 episodes at the internal state refinement frequency = 5............................67 Figure 16: Cumulative success rates, respective to the different stage of learning, of the K-S and

IGR test over 200 episodes at the internal state refinement frequency = 4............................68 Figure 17: Cumulative success rates, respective to the different stage of learning, of the K-S and

IGR test over 200 episodes at the internal state refinement frequency = 3............................68 Figure 18: The coefficient of variation for the K-S statistics during the first feature extraction

process in the robot navigation problem...................................................................................69 Figure 19: The coefficient of variation for the IGR during the first feature extraction process in

the robot navigation problem.....................................................................................................70 Figure 20: A snap shot of the New York driving problem................................................................71 Figure 21: Average accident counts per every 2000 steps for a K-S test U-Tree ............................75 Figure 22: Average accident counts per every 2000 steps for a IGR test U-Tree ...........................76 Figure 23: Average honk count of the K-S and IGR test per every 2000 steps...............................77 Figure 24: Average scrap count of the K-S and IGR test per every 2000 steps ..............................77 Figure 25: The K-S statistics for the first feature extraction process in the New York driving

problem........................................................................................................................................78 Figure 26: The IGR for the first feature extraction process in the New York driving problem ...78 Figure 27: A snap shot of the elevator control problem....................................................................80 Figure 28: Average waiting time of passengers under the K-S test U-Tree over 15 trials .............84 Figure 29: Average waiting time of passengers under the IGR test U-Tree over 15 trials ............85

5

Tables Table 1: Effect of discount factor on return.......................................................................................17 Table 2: An example of a history table before update.......................................................................44 Table 3: An example of a history table after update .........................................................................45 Table 4: Data table for a one-way ANOVA test.................................................................................57 Table 5: A U-Tree performance data table for a one-way ANOVA test..........................................57 Table 6: Data table for a two-way ANOVA test ................................................................................58 Table 7: The actions of the soccer agent.............................................................................................60 Table 8: The set of observations in the robot soccer problem ..........................................................61 Table 9: The set of parameters used in the robot soccer problem ...................................................62 Table 10: The actions of the driving agent .........................................................................................72 Table 11: The set of observations in the New York driving problem ..............................................73 Table 12: The set of parameters used in the New York driving problem........................................74 Table 13: The actions of an individual elevator .................................................................................81 Table 14: Examples of the central control action ..............................................................................81 Table 15: The set of observation in the elevator control problem....................................................82 Table 16: The set of parameters used in the elevator control problem............................................83

Equations Equation 1: Markov property .............................................................................................................18 Equation 2: Markov assumption on the state transition and reward ..............................................18 Equation 3: Action value function ......................................................................................................19 Equation 4: State value function .........................................................................................................19 Equation 5: Optimal state value function...........................................................................................20 Equation 6: Optimal action value function ........................................................................................20 Equation 7: Policy update equation by minimizing the Bellman residual for value iteration .......25 Equation 8: Policy update equation by minimizing the residual gradient for value iteration.......26 Equation 9: Policy update equation by minimizing the Bellman residual for Q-learning .............26 Equation 10: Policy update equation by minimizing the residual gradient for Q-learning...........27 Equation 11: The gradient of ( )thPr .................................................................................................31 Equation 12: A discounted sum of rewards .......................................................................................35 Equation 13: A long term average reward .........................................................................................35

6

Abstract Reinforcement Learning (RL) is a learning framework in which an agent learns a

policy from continual interaction with the environment. A policy is a mapping from

states to actions. The agent receives rewards as feedback on the actions performed.

The objective of RL is to design autonomous agents to search for the policy that

maximizes the expectation of the cumulative reward.

When the environment is partially observable, the agent cannot determine the states

with certainty. These states are called hidden in the literature. An agent that relies

exclusively on the current observations will not always find the optimal policy. For

example, a mobile robot needs to remember the number of doors went by in order to

reach a specific door, down a corridor of identical doors.

To overcome the problem of partial observability, an agent uses both current and past

(memory) observations to construct an internal state representation, which is treated

as an abstraction of the environment.

This research focuses on how features of past events are extracted with variable

granularity regarding the internal state construction. The project introduces a new

method that applies Information Theory and decision tree technique to derive a tree

structure, which represents the state and the policy. The relevance, of a candidate

feature, is assessed by the Information Gain Ratio ranking with respect to the

cumulative expected reward.

Experiments carried out on three different RL tasks have shown that our variant of the

U-Tree (McCallum, 1995) produces a more robust state representation and faster

learning. This better performance can be explained by the fact that the Information

Gain Ratio exhibits a lower variance in return prediction than the Kolmogorov-

Smirnov statistical test used in the original U-Tree algorithm.

7

Acknowledgement My deepest gratitude goes to my supervisor, Frederic Maire. Over the past years, my

demand of his time and patient for discussion, feedback and proof readings could be

at best described as unreasonable.

Without his constant encouragement (and harassment) and his expertise, I am sure

that this thesis would never have reached completion. In several cases, important

ideas would have fallen by the wayside without Frederic’s insight to point out the

interest.

I also thank C.I.T.I. for funding my study. Without this, not only would I not have had

the freedom to pursue my own research, I would have never have had the opportunity

to perform research at all.

Manix Au

04.03.2004

8

Introduction To The Thesis

Reinforcement Learning

Reinforcement Learning (RL) is a computational approach to automating goal-

directed learning and decision-making. It is a problem description rather than a

specific method (Sutton & Barto, 1998).

An RL agent learns how to map situations (states) to actions to achieve some given

tasks. The agent is not informed which actions to take, but instead must discover the

actions that provide long-term benefits, by trial and error.

The trial-and-error solution search characteristic gives RL significant practical value.

In many complex non-linear control problems, great difficulty arises in determining

the behaviour strategy of a controlling agent (Blythe, 1999). Moreover, it is

impractical to obtain examples of desired behaviour that are correct and representative

of all the situations in which the agent has to act. Using RL techniques, the agent

learns from its own experience to solve the problem autonomously through successive

interactions, without the need of supervision (Gullapalli, 1992).

Scope of the study

This thesis addresses an important research aspect of Reinforcement Learning. The

aim is to document the development of a new technique for the automated

construction for a goal directed representation of systems states in a decision tree.

The application of the system in Reinforcement Learning is demonstrated in three

problems where the goal of the system is to find a solution to a control problem

through experimentation. The new method is a variation on the existing U-Tree

method (McCallum, 1995) and the difference is in the selection criteria that are used

in tree construction. The proposed Information Gain Ratio test is compared and

contrasted with the Kolmogorov (K-S) test.

9

Overview

The thesis is structured in four chapters.

Chapter 1 presents an introduction to Reinforcement Learning (RL). Section 1.1

establishes the fundamental concepts in RL. In section 1.2, the components and the

learning structure of a typical RL system are explained. Section 1.3 describes the

different types of environments. Section 1.4 introduces the learning aspect of RL and

provides an overview of the different solution methods. The ancillary issues and some

successful examples of RL application are described in sections 1.5 and 1.6

respectively. Section 1.7 looks at the research challenges in RL while section 1.8

introduces automatic state construction as a strategy to solve the problems.

Chapter 2 presents the U-Tree method (McCallum, 1995), which is a RL algorithm

with automatic state construction functionality. The U-Tree method allows an agent to

construct and refine a tree structured representation of its internal state and policy by

extracting relevant features from the RL system. These features are composed of

present and past observations of the environment or events. Hence, the policy derived

is reactive to both the current and past observations. This allows a form of short term

memory into the behaviour.

In section 2.1, the U-Tree algorithm is explained in terms of how decision tree

technique can be applied in internal state construction. Details of the Kolmogorov-

Smirnov Two Sample test, which is the selection criterion used for tree expansion and

internal state refinement are discussed.

Section 2.2 provides the pseudo code for the U-Tree algorithm in terms of how

learning can be facilitated by a U-Tree and how a U-Tree grows with experience.

Section 2.3 discusses the limitations and shortcomings of the U-Tree algorithm.

Chapter 3 presents a variation of the U-Tree algorithm (McCallum, 1995). A new

feature extraction criterion, the Information Gain Ratio (IGR) test, is introduced to the

U-Tree framework for internal state construction.

Section 3.1 introduces the variant of the U-Tree by beginning with a description of the

original U-Tree. This then follows a section that discusses the technical background

of Information Theory and then on how it can be applied to give a new RL algorithm.

Section 3.2 provides the pseudo code for the IGR test U-Tree algorithm in terms of

the internal state refinement process.

10

A set of three experiments are conducted to assess the IGR test with respect to the K-

S test in growing a U-Tree. In section 3.3, the ANOVA test, which is used to compare

performance between the two U-Tree algorithms, is discussed. Then, the three

problem domains are described. These problem domains include robot navigation,

driving, and elevator control. Each problem domain subsection provides the

description of the environment, the task, the action set, the reward, the candidate

feature set, the training conditions, results and comparisons.

Chapter 4 lays a foundation for the work that follows.

Original Contributions

o Introductory guide to Direct Gradient methods

o The application of decision tree technique in automatic state construction

o The implementation of the simulations and the tree structure

o Apply the IGR test to the U-Tree framework to produce a variant of the U-Tree

for comparison in three set of experiments

o Performances were evaluated and the results were accepted in the CIMCA

conference 2004

11

1 Introduction to Reinforcement Learning

Chapter summary

Chapter 1 presents an introduction to Reinforcement Learning (RL). Section 1.1

establishes the fundamental concepts in RL. In section 1.2, the components and the

learning structure of a typical RL system are explained. Section 1.3 describes the

different types of environments. Section 1.4 introduces the learning aspect of RL and

provides an overview of the different solution methods. The ancillary issues and some

successful examples of RL application are described in sections 1.5 and 1.6

respectively. Section 1.7 looks at the research challenges in RL while section 1.8

introduces automatic state construction as a strategy to solve the problems.

1.1 What is Reinforcement Learning?

Reinforcement Learning (RL) is a computational approach to automating goal-

directed learning and decision-making. It is a problem description rather than a

specific method (Sutton & Barto, 1998).

An RL agent learns how to map situations (states) to actions to achieve some given

tasks. The agent is not informed which actions to take, but instead must discover the

actions that provide long-term benefits, by trial and error. For example, a chess

playing agent must be able to determine which moves have been critical to the

outcome and then alter its strategy accordingly.

The trial-and-error solution search characteristic gives RL significant practical value.

In many complex non-linear control problems, great difficulty arises in determining

the behaviour strategy of a controlling agent (Blythe, 1999). Moreover, it is

impractical to obtain examples of desired behaviour that are correct and representative

of all the situations in which the agent has to act. Using RL techniques, the agent

learns from its own experience to solve the problem autonomously through successive

interactions, without the need of supervision (Gullapalli, 1992).

12

1.2 The architecture of a RL system

A RL system is composed of an agent and its environment.

1.2.1 The agent and the environment

In a RL system, a learning agent is embedded in an environment. The environment

consists of everything outside of the agent. It is what the agent can perceive and act

on. In the RL framework, the agent-environment interaction is abstracted in a triplet

of (state, action, reward).

• A state represents a particular snapshot of the environment or a situation. It

becomes the basis for the agent to select an action to carry out.

• An action represents a decision made by the agent depending on the state of

the environment.

• A reward is a scalar feedback from the environment. It quantifies how good a

performed action is.

The following table illustrates some exemplary control problems with the states, the

actions and the reward defined.

Task State Action Reward A chess game All the possible

combination of the chess board

The legal moves +1 when the game is won; -1 otherwise

Cart pole balancing The deviation angle, the angular

velocity of the pole, the velocity of the cart and the distance of the cart

from the edge

The magnitude and direction of force applied to the cart

-1 when the pole falls or when the cart goes over the edge, otherwise 0

Motion control of a robotic arm in a

repetitive pick-and-place task

The object, the joint angles and

velocities

The amount of voltage applied to

motors in each joint

+1 if each object is successfully

placed, otherwise 0

Object avoidance behaviour of a mobile robot

The positions of the objects in the

panoramic view of the robot

Travelling forward, left, right and rear.

-1 if the agent bumped into an

object, otherwise 0

Table 1: Examples of RL tasks

13

1.2.2 Overview of the agent-environment framework

Diagrammatically, the agent-environment interaction can be represented as follows.

Figure 1: Agent-environment interaction framework

The agent and the environment interact continually in a sequence of time steps. At an

arbitrary time step, the agent observes the environment’s state Sst ∈ where S

represents the set of states of the environment. Depending on the state, the agent

selects an action ( )tt sAa ∈ where ( )tsA represents the set of action available in state

ts . The agent then receives a numeric reward tr as a consequence of the action ta

performed and the environment changes to a new state 1+ts .

1.2.3 The learning structures of a typical RL problem

The key learning structures to a RL problem are the policy, the reinforcement function

and value function.

• A policy tπ is part of the agent. It is defined as a mapping from the states to

the actions.

14

• A reinforcement function is designed to provide a reward to the agent, on the

action performed in a particular state at each time step. The aim is to implicitly

define the desired behaviour of the agent in terms of rewards so that a good

policy can be learnt through trial and error.

• Value functions provide a measure of the goodness of the states or the

goodness of the state and action pairs. The mathematical definition of the

value function forms the basis for the development of efficient value-based RL

algorithms.

1.3 Different types of the environment

Different types of environments require more domain specific approaches. The

environments can be categorized according to the nature of the state space, the

observability and the availability of a model of the environment.

1.3.1 The nature of the state space of the environment

In simple RL systems, the state space of the environment is discrete. For example, the

state space in a tic-tac-toe game is represented by all the possible combination of the

circles and crosses on the board. In practice, many RL systems have continuous state

space. In the pole balancing problem, the state space is by the cross product formed

by some features of the environment. These features are the angle of the pole, the

angular velocity of the pole, the position of the cart, the velocity and / or the

acceleration of the cart.

An environment contains many features. However some are irrelevant. For example,

the position, the colour and the size of the circles and crosses are features of the

environment in a tic-tac-toe game. A logical state space can be defined using the set

of circles and crosses combination on the board, regardless the colours or sizes of

those. Only the position of the circles and the crosses is of concern for state definition.

1.3.2 The observability of the environment

When an environment is termed fully observable, all the features are well-perceived

by the agent. The agent can use its perception to comprehend the environment’s state.

For example, a chess-playing agent needs to know the positions of all the chess pieces

to define the state so as to make a move.

15

The environment is said to be partially observable when some features of the

environment are not perceivable. Partial observability can be caused by various

reasons such as the limitations of the sensors, noise and occlusions. Under these

circumstances, an agent is unable to disambiguate amongst the different states. The

undistinguished states are called hidden states. For instance, a driver agent learns to

navigate through traffic with the state space defined by the position of the vehicles in

the agent’s current view. To pass a vehicle that is close in front, the agent needs to

attend the blind spot for clearance to overtake. In this situation, accident could happen

because the agent cannot perceive both views at the same time.

1.3.3 The availability of a model of the environment

When a model of the environment is provided, the transition dynamics of the

environment is known to the agent. During the agent-environment interaction, the

agent performs an action, which causes the environment to transit into the next state.

Deterministic state transition implies that the state transition probability from a state

to another, given an action, is unity. Stochastic state transition imparts a state would

exit to a possible number of states, given an action, with probabilities associated

accordingly. When the model of the environment is given, the state transition

probabilities from the current state to all the possible future states are known.

16

1.4 The learning aspect

RL methods originated from two disciplines, Dynamic Programming (DP) and

supervised learning. DP is a field of Mathematics, which has traditionally been used

to solve problems of optimization and control. Nevertheless, traditional DP is limited

in the size and the complexity of the problem it can address. Supervised learning is a

general method for training a parameterized function approximator to represent

learning functions. It requires sample pairs of input and output from the function that

is to be learnt. Unfortunately, sample outputs cannot be easily obtained in practice.

The following sections explain the classic RL methods, which apply the control

algorithms developed in DP to learn the value of action and hence to facilitate control.

Section 1.4.5.2 introduces a modern RL approach, the Policy Search method, which

learns control directly through some gradient function.

1.4.1 The policy, the return and the value function

A policy tπ , represents a probability mapping from states to actions.

( ) ( )ssaaas ttt === |Pr,π . A RL agent learns to select a good action depending on

the state and RL methods specify how the agent changes its policy as a result of its

experience.

The reward tr represents the immediate feedback for the action ta chosen given the

state ts . To choose an action that is beneficial in long term, the agent tries to

maximize a quantity called the return tR .

The return tR is the sum of discounted rewards from the current state proceeding to

the terminal state. Formally, the return can be defined by ∑=

−=T

tnn

tnt rR γ where γ ,

10 ≤≤ γ is a discount factor and T , tT ≥ is the time when the terminal state is

reached.

17

The discount factor γ determines the ratio of the future rewards to be discounted. The

table below illustrates the effect of γ on tR .

Discount factor Return Proportion of tr in tR 0=γ tt rR = 1 1=γ Tttt rrrR +++= + K1 T

1

Table 1: Effect of discount factor on return

When the discount factor 0→γ , the return tR attends more on the rewards in near

future relatively. When the discount factor 1→γ , the return tR takes into account the

rewards in further distant future.

To obtain a good policy, an agent must select actions that maximise the return such

that the agent can move into better states towards the goal. A value function is a

goodness measure of a state and there are two forms of value functions being

considered in RL.

Under policy π , a state value function πV estimates the goodness for an agent to be

in a given state; an action value function πQ estimates the goodness for an agent to

perform a given action in a given state.

The estimation of value functions is defined with respect to a particular policy. This

forms the foundation of the classic Value Based RL algorithms where the estimation

of value functions affects the learning of a policy.

Formally, under a policy π , the value of a state s , which is denoted ( )sVπ , is defined

as the expected return when starting in s and following π thereafter such that

( ) [ ]ssREsV tt == |ππ . Similarly, the value of performing action a in state s under

policy π , denoted ( )asQ ,π , defines the expected return starting from s , taking

action a and thereafter followingπ . We have ( ) [ ]aassREasQ ttt === ,|, ππ .

Value functions are estimated from experience. They can be represented in table

format in simple and discrete state space or with parameterized function

approximators under the continuous state space.

18

1.4.2 The key to learning

The Bellman equations (Bellman, 1957) are mathematical equations that form the key

to Value Based RL. By expressing the value functions in Bellman equation form, the

value of the states can be iteratively updated and learnt.

The theoretical development of the Bellman equations is made under the model

assumption of a Markov process, which is introduced in the following section, the

Markov property.

1.4.2.1 The Markov property

The Markov Property is a mathematical assumption that the dynamics of a system is

independent of any observations or events beyond the immediate past.

1.4.2.1.1 Markov process

A Markov process is a stochastic process in which the future distribution of a variable

depends only on the current value of the variable. A Markov process that describes

the state transition of an environment is mathematically expressed as follows.

( ) ( )ttttt ssssssss |'Pr,,,|'Pr 1011 === +−+ K

Equation 1: Markov property

The equation states that the state transition depends only on the current state. This

memory-less property of the state transition is called the Markov property.

1.4.2.1.2 Markov Decision Process

A dynamic system that satisfies the Markov property is called a Markov Decision

Process (MDP).

Definition: In the RL framework, a MDP is composed of (Puterman, 1994) a state

space S ; an action space A ; a reinforcement function ℜ→× ASr : , a policy

AS →:π and a state transition function ( )SPASP →×: .

It is assumed that the Markov property holds for the state transition and the reward

that they depend only on the current state and the action at each time step.

( ) ( )tttttttt asrrssasasrrss ,|,Pr,,,|,Pr 110011 =′===′= ++++ K

Equation 2: Markov assumption on the state transition and reward

19

Given any state Ss∈ and action Aa∈ , the transition probability into the next

possible state s′ , is denoted by ( )aassssP tttass ==′== +′ ,|Pr 1 . The expected value

of the next reward is denoted by ( )ssaassrER ttttass ′==== ++′ 11 ,,| .

1.4.2.2 The Bellman equations

The Bellman equation provides the basis for the agent to approximate and learn value

functions for the policy (Bellman, 1957). The one-step dynamic of the environment

holds a recursive relationship between the value of the current state and the value of

the next possible states.

( ) [ ]( )[ ]∑

′′′ ′⋅+=

===

s

ass

ass

ttt

sVRP

aassREasQ

π

ππ

γ

,|,.

Equation 3: Action value function

Similarly, a state value function states that the value of a state must equal the sum of

the discounted value of the expected state and the return expected over all the possible

expected states and over all the actions that cause the possible transitions.

( ) [ ]( ) ( )[ ]( ) ( )asQas

sVRPas

ssREsV

a

a s

ass

ass

tt

, ,

,

|

π

π

ππ

π

γπ

∑

∑ ∑=

′⋅+=

==

′′′

Equation 4: State value function

20

1.4.2.3 The Bellman optimality equations

The state value function ( )sVπ represents the value of s under π . The state value

function defined under an optimal policy *π is termed an optimal state value function

( ) ( )sVsV ** =

π.

When ( )sV * is achieved, the value functions in the Bellman equation form can be

expressed in a special form, known as the Bellman optimality equation.

In Bellman optimality equation form, ( )sV * is defined as the expected return for the

best action from that state.

( ) ( )( )[ ]∑

′′′ ′+=

=

s

ass

assa

a

sVRP

asQsV*

**

max

,max

γ

Equation 5: Optimal state value function

Similarly, the optimal action value function states that the optimal value of a state and

action pair equals the sum of the expected return from the state and action pair and

that from the best action in the next state.

( ) ( )[ ]( )( )[ ]

( )( )[ ]∑′ ′′′

+′+

++

′′+=

==′+=

==+=

s a

ass

ass

tttat

tttt

asQRP

aassasQrE

aasssVrEasQ

,max

,|,max

,| ,

*

1*

1

1*

1*

γ

γ

γ

Equation 6: Optimal action value function

The optimal value function *V is the fixed point of the function ( )VfV → where

( ) [ ]∑′

′′ +=s

ass

assa

VRPVf max γ . This fixed point *V is the limit of the sequence

( ) ( ) ( )VfVfVfV n , , , , 2 K .

So, the Bellman optimality equation consists of a system of non-linear equations with

one Bellman optimality equation for each state, independent of the policy π and the

fixed point *V can be iteratively solved by substituting πV in the Bellman optimality

equation.

21

1.4.3 An example: A finite grid environment

A simple RL problem under a finite grid environment is used to demonstrate how an

optimal policy can be obtained with the use of a value function.

The environment is represented by a four-by-four grid with each square as a state. The

action set is composed of four directional motor actions as moving up, down, left and

right one grid distance. The reinforcement function is -1 in every state. The terminal

states are located on the upper left and the lower right square, labelled ‘Goal’. The

agent starts randomly on the grid and tries to reach any of the two goal states. It

follows a random policy to choose one of the four possible actions.

Goal

Goal

Figure 1: A finite grid problem

The value function learnt under a random policy is shown in the following grid with

the numbers in the square indicating the expected values of the states. For instance,

when starting in the lower left corner and following a random policy, the agent takes

14 transitions on average to reach the goal state.

0 -14 -20 -22

-14 -18 -22 -20

-20 -22 -18 -14

-22 -20 -14 0

Figure 2: Value function under random policy of a finite grid problem

Since a random policy is suboptimal, the agent uses a new policy. It follows a new

action selection heuristic to move into a neighbouring state with a higher state value.

As a result, the agent always moves to a better state that is closer to the terminal state.

22

This new policy is optimal with respect to reaching the terminal state quickly. The

optimal value function derived under this optimal policy is shown, follows the

optimal policy diagram.

0 -1 -2 -3

-1 -2 -3 -2

-2 -3 -2 -1

-3 -2 -1 0

Figure 3: Value function under optimal policy of a finite grid problem

Figure 4: Optimal policy diagram of a finite grid problem

This example illustrates the concept of classic Value Based methods, which use value

functions to derive an optimal policy. Many algorithms in RL are devised to find the

value functions efficiently.

23

1.4.4 Relationship between the optimal value functions and the optimal policy

Value functions can be used to describe a partial ordering over some policies. This

partial ordering of policies is explained as followed.

Definition: A policy π is better than or equal to a policy π ′ if and only if

( ) ( )sVsV ππ ′≥ for all Ss∈ .

The optimal value function ( ) ( )sVsV ππmax* = is the unique fixed point of the Bellman

optimality equation. Therefore, the optimal value function *V can be attained by a set

of optimal policies, amongst which there exists one deterministic optimal policy.

When the optimal value function *V is solved, a deterministic optimal policy *π

chooses only the action at which the maximum is attained in the Bellman optimality

equation for each state. In other words, a deterministic optimal policy *π assigns non-

zero probability only to these actions.

For finite MDPs, the deterministic optimal policy *π is said to be “greedy” with

respect to the optimal state value function. This is because the actions that appear best

after a one-step search are treated the optimal actions once the optimal state value

function is defined. The term “greedy” as in “greedy policy” is used in computer

science to describe any search or decision procedure that selects alternatives based

only on local or immediate considerations. Greedy search algorithm does not involve

any backtracking, which considers the possibility that a selection may prevent future

access to even better alternatives.

In the case of optimal action value function ( ) ( )asQasQ ,max,*ππ

= , the one-step

search is not required. The agent can simply find any action that maximizes ( )asQ ,* .

And this is the benefit of representing the function of the state-action pairs, instead of

the states, when the dynamics of the environment is not provided.

24

1.4.5 The different techniques in solving RL problems

There are two major learning approaches in RL problems. They are the classic Value

Based methods and the modern Policy Search methods. In RL problems, the aim is to

learn a good policy. Policy Search methods and Value Based methods differ in the

mechanism of how a good policy can be learnt (Aberdeen, 2002).

In Value Based methods, the value functions are learnt in order to obtain a good

policy (Sutton & Barto, 1998, Littman & Sun, 2000). In contrast, Policy Search

methods do not require any value functions to update the policy. The policy is learnt

directly through a parameterized function, which is changed according to some

measurements (Aberdeen, Baxter, 2002, Hausen, 1998, Shelton, 2001). An example

of these measurements is the gradient of the return (Baird & Moore, 1999).

1.4.5.1 Value based methods (learning the value of actions)

In value based methods, the value functions are iteratively updated by satisfying the

the Bellman equation. The policy is output indirectly via the learning of value

functions.

1.4.5.1.1 Value iteration (learning with a model)

For a state s , let ( )sV be the approximation of the true but unknown optimal value

function ( )sV * . In general, ( )sV values are randomly initialized.

Let s′ be the possible state from s . Iteratively loop on all s , the approximation ( )sV

is updated to solve for ( )sV * by satisfying the Bellman equation

( ) ( )[ ]∑′

′′+ ′+=s

kass

assak sVRPsV max1 γ where ( )sVk represents the thk approximation of

( )sV to ( )sV * .

Lookup tables

For lookup table value functions, the optimal value function can be solved by

updating each state value according to the Bellman optimality equation,

( ) ( )[ ]∑′

′′ ′+←s

ass

assa

sVRPsV max γ . The sweep update of state values minimizes the

Bellman residual ( )[ ] ( )sVsVRPs

ass

assa

−⎟⎠

⎞⎜⎝

⎛ ′+=Δ ∑′

′′ max γ until the residual is smaller

25

than a positive constant. Upon the convergence of the state value functions, an

optimal policy can be defined as ( ) ( )[ ]∑′

′′ ′⋅+⋅=s

ass

ass

asVRPs ** maxarg γπ .

Value based methods in lookup table can be severely affected by the size and nature

of the state space due to sweep update.

Function approximators by minimizing the Bellman residual

In many real world problems, the state spaces can be very large and continuous. For

this reason, a continuous model such as neural network or some other function

approximator, is needed for the approximation ( )wsV , of optimal value function

( )sV * , where w is the network parameter vector.

Given α is the learning rate, the parameter w is updated to minimize the Bellman

residual, ( )[ ] ( )ts

tass

assa

wsVwsVRP ,, max −′+⋅∑′

′′ γ according to the following update

equation.

( )[ ] ( ) ( )t

tt

st

ass

assat w

wsVwsVwsVRPw∂

∂⎥⎦

⎤⎢⎣

⎡−′⋅+⋅−=Δ ∑

′′′

,,, max γα

Equation 7: Policy update equation by minimizing the Bellman residual for value iteration

Using function approximator for value based methods can result in convergence

problem in wΔ . The desired value of the state ( )[ ]∑′

′′ ′+s

tass

assa

wsVRP , max γ is

expressed as a function of the parameter wwt = at time t . When the update ww ′→

is performed, the target value changes as a new function of a different parameter

wwt ′=+1 . As a result, the value of the Bellman residual can increase and this causes

the value of the parameter w to oscillate or to grow to infinity.

Using function approximators by residual gradient

To overcome the convergence problem of using function approximators by Bellman

residual, gradient descent technique is performed on the mean squared Bellman

residual. This method is termed residual gradient algorithm with residual term

( ) ( )[ ]ttt wsVwsVr ,, −′+ γ , which guarantee convergences of the parameter w to the

local minimum. The network parameter update equation is given as followed.

( ) ( )[ ] ( ) ( )⎥⎦

⎤⎢⎣

⎡∂

∂−

∂′∂

−′+=Δt

t

t

ttttt w

wsVw

wsVwsVwsVrw ,, ,, γγα

26

Equation 8: Policy update equation by minimizing the residual gradient for value iteration

1.4.5.1.2 Q-Learning (learning without a model)

Q-learning (Watkins, 1992) was one of the most important breakthroughs in RL

because Q-learning extends the traditional DP value iteration methods for RL.

The value iteration method requires the finding of the action that returns the

maximum expected value. This involves computing the sum of the reinforcement and

the integral over all possible successor state for the given action and this is

computationally expensive in practice.

Q-learning solves this issue by simply taking the maximum over the state value set of

the successor states. Rather than learning the state value for each state as in value

iteration, Q-learning learns the Q-value (or action value) for each state, action pair.

Therefore, there is a Q-value associated with each action in each state. And the Q-

value representing function is called a Q-function (or action value function).

During Q-value update, Q-learning does not require the expected values of the

successive state to be calculated. The value of a state is defined to be the maximum

Q-value in the given state.

Lookup table

When the Q-function is represented in lookup table form, the Q-values are updated

according to ( ) ( )( )asQrasQat ′′+←′

,max , γ . The updates of Q-values minimize the

Bellman residual ( )( ) ( )asQasQrat ,,max −′′+=Δ′

γ .

Using function approximators

When the state and action product space is large, a neural network can be used instead

to train the Q-function. The following equations are the parameter update equation for

minimizing the Bellman residual and that for the residual gradient.

Bellman residual:

( ) ( )[ ] ( )t

tttatt w

wasQwasQwasQrw∂

∂−′′+−=Δ

′

,,,,,,max α

Equation 9: Policy update equation by minimizing the Bellman residual for Q-learning

Residual gradient:

27

( ) ( )[ ] ( ) ( )⎥⎦

⎤⎢⎣

⎡∂

∂−

∂′′∂

−′′+=Δ′

t

t

t

tttatt w

wasQw

wasQwasQwasQrw ,,,, ,,,,max γγα

Equation 10: Policy update equation by minimizing the residual gradient for Q-learning

The property of these function approximators is analogue to those in the value

iteration methods. The form that minimizes the Bellman residual can have

convergence problems whilst the residual gradient form guarantees convergence to a

stable Q function.

28

1.4.5.2 Policy search methods (learning without a model and without

the value functions)

Under the constraints of partial observability, Value Based methods learning is very

complicated (Murphy, 2000, Lanzi, 2000, Lin, Mitchell, 1992). Policy Search

methods have become more practical than Value Based methods in POMDPs

(Aberdeen, Baxter, 2002, Baird & Moore, 1999, Cassandra, 1999, Murphy, 2000) to

avoid the difficulty and complexity involved in learning value functions under partial

observability.

The nature of Policy Search methods is model-free. In policy search methods, the

policy space is searched by direct optimization methods. The policy is explicitly

represented by its own function approximator (Cassandra, Kaelbling & Littman, 1994,

Peshkin, 2000). This function approximator does not learn any value function. During

learning, policy improvement is achieved by direct gradient ascent (or descent) on

some error functions or by paired comparisons amongst policies. In other words, it is

easier to learn how to act in Policy Search methods than to learn the value of actions

as in Value Based methods.

Value Based methods can suffer from convergence problems (Astrom, 1965) where

the value function either diverges or oscillates between the possible best and worst

outcome under greedy exploration. Policy Search methods guarantee to converge to a

local optimum. However, the learning process can be slow and the convergence does

not guarantee a global optimum in the policy search space.

29

1.4.5.2.1 Direct gradient methods

In direct gradient approach, the policy is represented by a parameterized function with

a weight vector w for action selection. The weight vector w is adjusted to improve

the performance of the agent. The adjustment is made in the direction of the gradient

of the expected return.

Preparation for deriving the gradient function of the return

Let h represent a training history such that th is composed of the state, action and

reward triplet from the start time 0 up to time t .

( ) ( ) ( ){ }tttt rasrasrash ,,, , ,,, ,, 111000 K=

The likelihood of 0h is the probability of the events happening at start.

( ) ( ) ( ) ( )000000 |Pr|PrPrPr srsash =

The likelihood of 1h is the probability of the events happening from the start up to

time 1=t .

( ) ( )( )( ) ( )( )( ) ( ) ( )( ) ( ) ( )1111001

00000

01110

01111

|Pr|Pr,|Pr |Pr|PrPr

|,,PrPr ,,,PrPr

srsaasssrsas

hrashhrash

===

Similarly, the likelihood of 2h is the probability of the events happening from the start

up to time 2=t .

( ) ( )( )( ) ( )( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )2222112

1111001

00000

12221

12222

|Pr|Pr,|Pr |Pr|Pr,|Pr

|Pr|PrPr|,,PrPr

,,,PrPr

srsaasssrsaass

srsashrash

hrash

===

30

Hence, the likelihood of a particular th is the probability of the events happening

from the start up to time t .

( ) ( )( )( ) ( )( )

( ) ( ) ( )

( ) ( ) ( )( )( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( ) ( ) ( ) ( ) ( )∏−

=+

−−−−−−

−−

−−

−

⋅⋅⋅⋅⋅=

⋅⋅⋅⋅

⋅⋅⋅⋅=

⋅⋅

⋅⋅⋅==

⋅==

1

010

111111

0010000

0

11

00000

11

1

,|Pr|Pr|Pr|Pr|PrPr

|Pr|Pr ,|Pr|Pr|Pr

,|Pr|Pr|Pr

Pr|Pr|Pr,|Pr

|Pr|PrPr

|,,PrPr ,,,PrPr

t

kkkkkkkktttt

tttt

ttttttt

ttttttt

ttttt

ttttt

asssrsasrsas

srsaasssrsa

asssrsas

srsaass

srsas

hrashhrash

L

L

L

Since the action selection function ( )tt sa |Pr is a non-zero smooth function of the

weight vector w , let ( ) ( )wfsa ttt =|Pr to simplify the expression ( )thPr .

( ) ( ) ( ) ( ) ( ) ( ) ( )∏−

=+⋅⋅⋅⋅⋅=

1

010 ,|Pr|Pr|PrPrPr

t

kkkkkkktttt asssrwfsrwfsh

Note that the terms ( )00 |Pr sr , K , ( )tt sr |Pr , ( )001 ,|Pr ass , K , ( )11 ,|Pr −− ttt ass and

( )0Pr s are not functions of w . The state transition function and the reinforcement

function are part of the environment and they are invariant to the policy parameter.

The likelihood expression, ( )thPr , is then differentiated with respect to w to provide

the gradient of the likelihood ( )tw hPr∂∂ .

31

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

( ) ( )

( )

( ) ( )

( ) ( )( )

( ) ( )( )

( ) ( )( )⎥⎦

⎤⎢⎣

⎡⋅=

⎥⎦

⎤⎢⎣

⎡⋅⋅=

⎥⎦

⎤⎢⎣

⎡⋅=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

⋅=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡⋅=

⋅=⎥⎦

⎤⎢⎣

⎡⋅=

⎥⎦

⎤⎢⎣

⎡⋅⋅=

⎥⎦

⎤⎢⎣

⎡⋅⋅⋅=

∑

∑∏

∑∏

∑∏

∑∏

∏∏

∏∏

∏

=∂∂

=∂∂

=

=∂∂

=

=∂∂=

=∂∂

≠=

=+

=∂∂

=∂∂

=+

−

=+∂

∂∂∂

t

kkkwt

t

nnw

t

kk

t

nnw

t

kk

t

nnw

n

t

kk

t

nnw

t

nkk

k

t

kkkkkk

t

kkw

t

kkw

t

kkkkkk

t

kkkkkkktttwtw

sah

wfwfc

wfwfc

wfwf

wfc

wfwfc

asssrscwfc

wfasssrs

asssrwfsrwfsh

0

00

0 0

0

0

0 0

010

0

0010

1

010

|PrlnPr

ln

ln

,|Pr|PrPr where

,|Pr|PrPr

,|Pr|Pr|PrPrPr

Equation 11: The gradient of ( )thPr

This likelihood gradient ( )tw hPr∂∂ will be used in the calculation of the gradient

function of the return.

The return

Let Η represent a set of training history h . Let ( )hR be the return and

( )[ ] ( ) ( )∑Η∈

⋅≈h

hhRhRE Pr be the total expected return obtained from the training

history h .

32

The gradient of the return

Let the terminating time of training history h be ( )hT .

By differentiating ( )[ ]hRE with respect to w and substituting the result from ( )hw Pr∂∂ ,

the return gradient ( )[ ]hREw∂∂ can be obtained as follows.

( )[ ] ( ) ( )

( ) ( ) ( ) ( )[ ]

( ) ( ) ( ) ( ) ( )( )( )

( ) ( ) ( ) ( )( )( )

( ) ( ) ( )( )( )

∑ ∑

∑ ∑

∑ ∑

∑

∑

Η∈ =∂∂

Η∈ =∂∂

∂∂

Η∈ =∂∂

∂∂

Η∈∂∂

∂∂

Η∈∂∂

∂∂

⎥⎦

⎤⎢⎣

⎡⋅⋅=

⎥⎦

⎤⎢⎣

⎡⋅+⋅=

⎥⎦

⎤⎢⎣

⎡⋅⋅+⋅=

⋅+⋅=

⎥⎦

⎤⎢⎣

⎡⋅=

h

hT

kkkw

h

hT

kkkww

h

hT

kkkww

hww

hww

hahRh

hahRhRh

hahhRhRh

hhRhRh

hhRhRE

0

0

0

|PrlnPr

|PrlnPr

|PrlnPrPr

PrPr

Pr

The term ( )hRw∂∂ vanishes because the return is part of the environment. The policy

parameter is adjusted to change the probability of some training history happening but

does not alter the return obtained over that training history.

For example, the following figure shows a training history set of four histories,

{ }4,3,2,1 hhhh with their respective return observed. In Direct Gradient

methods, the policy parameter w is adjusted to increase the probability of 3h

happening and to reduce that of 4h . However, this change in policy does not affect

the return observed over the respective training histories.

Figure 5: Example of return realization in training history

33

Under the Markov assumption, ( ) ( ) ( )wfsaha ttttt == |Pr|Pr represents the policy.

The gradient of the return can be re-expressed as follow.

( )[ ] ( )[ ] ( ) ( ) ( )( )( )

∑ ∑Η∈ =

∂∂

∂∂

∂∂

⎥⎦

⎤⎢⎣

⎡⋅⋅==

h

hT

kkwww wfhRhwhREhRE

0lnPr,

The return gradient vector ( )[ ] ( )[ ] ( )[ ] ( )[ ][ ]hREhREhREwhREwwww ∂∂

∂∂

∂∂=∇ ,,,,

21L

can now be obtained.

Perform gradient ascent on the return and update the policy vector

The return gradient vector ( )[ ]whRE ,∇ is the gradient of the return with respect to the

weight vector of a current policy. The policy is represented by a point w in the policy

space. It is possible to move the value of w a small distance δ in the direction

( )[ ]( )[ ]whRE

whREd,,∇

= , which is the direction in which the return ( )[ ]hRE increases most

rapidly. The general policy update equation for direct gradient methods is provided by

dww ⋅+← δ .

34

1.4.5.2.2 Policy comparison methods

Policy comparison approach is characterized by using comparative information on

some policies to find the optimal policy. Given a policy is represented by a

parameterized function with a weight vector w , the objective is to minimize the

difference in expected return ( ) ( )[ ] ( )[ ]wREwREwwD −= **, between the optimal and

some policy under comparison (Strens & Moore, 2002, Ng & Gordon, 2000).

When comparing two policies with parameter vector 1w and 2w respectively, a fixed

set of initial states { }nsss ,02,01,0 , , , L is used to reduce the stochasticity of the

environment such that the variability in return difference ( )21,0wwDs over a particular

initial state, is prominently subjected to the difference in policy parameters.

Given a large set of initial states n , paired statistical tests can be applied to model the

difference between two policies, which are evaluated with the same starting states.

Possible choices of the statistical test include the 2 sample t-test and the Wilcoxon

test.

The 2 sample t-test assumes the n return differences follow a standard normal

distribution and returns a probability, which indicates the likelihood of a non-zero

mean for the return differences. The Wilcoxon test uses the rank of the return

differences to return a probability that indicates the likelihood of a non-zero median.

Once a better policy is determined, the less effective policy is replaced by a new

policy. The parameter of the new policy neww can be generated by some optimization

procedures. These procedures include the Downhill Simplex Method, which generate

neww in the opposite direction of the previously less effective policy parameter and

Differential Evolution, which uses genetic algorithms to improve neww directly.

35

1.5 Ancillary issues

The following two sub-sections describe two common RL issues namely the

exploration-exploitation dilemma and the temporal credit assignment problem. The

last sub-section explains a shaping technique, which is used to overcome complex RL

problems.

1.5.1 Exploration–exploitation dilemma

An agent is exploiting when it selects an action to obtain reward that is already

known; exploring when it selects another action to gain new information. It is

observed that a system may lose performance by exploring while it may never

improve its performance through exploiting, (Holland, 1992). It is important to

balance the exploration and exploitation ratio so as to maximize the knowledge gained

during learning and to minimize the costs of exploration and learning time. Common

practice used to overcome the exploration-exploitation dilemma is to implement a ε -

greedy policy, in which the agent exploits with a probability of ( )ε−1 and explores

with a probability of ε .

1.5.2 Temporal credit assignment

The rewards provided to the agent can be noisy and delayed from the actions that

caused them. The designer has to deal with the problem of how to reinforce actions

that have long-reaching effect. The return to be maximized can be formulated as the

followings.

( ) ⎥⎦

⎤⎢⎣

⎡== ∑

∞

=

ssrEsRt

tt

t 00

| γ

Equation 12: A discounted sum of rewards

( ) ⎥⎦

⎤⎢⎣

⎡== ∑

=∞→

T

ttTTt ssrEsR

10

1 |lim

Equation 13: A long term average reward

The discount version of reward is more robust because it allows the solution of the

temporal credit assignment problem in infinite horizon settings (Sutton & Barto,

1988).

36

If the class of the state transitions is ergodic, the states are eventually visited

arbitrarily often. The long term average version is used since it gives equal weighting

to all rewards received throughout the evolution of the process.

1.5.3 Shaping

The size of the state space correlates to the complexity of the task given. Searching

through a large policy space for optimal actions could be time consuming. Shaping is

used to ease the agent’s learning in complex problems. This is achieved by giving the

agent a series of relatively easy problems building up to the harder problem of

interest, (Selfridge, Sutton & Barto, 1985).

Consider an example of training a mechanical arm to pick up an object. The machine

has control over the various joints of the arm. Given the object is within reachable

distance of the arm, the movement of joints that allow the claw to move on top and

then go down to hold onto the object is combinatorial.

Shaping can be applied to ease learning by decomposing the task into simpler sub-

tasks. The arm can firstly learn to bring the claw in line with the object. Then, it learns

to bring the claw over the object. And finally, the arm can bring the claw down to

hold onto the object.

37

1.6 RL Applications

RL methods are practical for real applications. In a typical control situation, the

controllers of automated processes must adapt to a dynamically changing environment

where the optimal heuristic is not known. A substantial number of applications have

proven that the automated nature of RL is fruitful.

1.6.1 Cellular communication system

A RL method is applied to a large cellular telephone system with approximately 4970

states (Singh, Jaakkola & Jordon, 1995). The task of bandwidth allocation is

formulated as dynamic programming problem. The goal is to find a dynamic channel

allocation policy, which maximizes services by minimizing the number of blocked

calls and drop-off calls.

In this optimal channel control problem, the large state space makes traditional DP

techniques impractical. This large scale optimal control problem is solved by the DP

induced RL method. Better solutions are found than any previously heuristics

available. This demonstrates the superiority of RL paradigms in complex and large

optimal control applications.

1.6.2 Others

Other concrete and successful RL applications are elevators controls (Crites & Barto,

1996), job-shop scheduling (Zhang & Dietterich, 1995), game learning such as chess

and Tesauro (Sutton & Barto, 1998), power network distribution (Schneider, 1999)

and Internet partner scheduling (Abe & Nakamura, 1999).

38

1.7 Research challenges in RL

Current research in RL has addressed the scalability of algorithms in large state

spaces, the partial observability of the environment and the limitation of reactive

behaviour upon tasks which require memory.

1.7.1 Scaling up to large problems

1.7.1.1 Complex task

Hierarchical decomposition of task represents a strategy for dealing with very large

state spaces (Hernandez-Gardiol & Mahadevan, 2000, Dietterich, 2000). Such

motivation is to provide faster learning, with a trade-off of a slight sub-optimality in

performance, through the decomposition of a task into a collection of simpler

subtasks. This is essentially relevant to situations when the training time is limited.

1.7.1.2 Large and continuous state spaces

Compact value function representation is necessary for value function approximation

in problems with large and continuous state spaces (Cassandra, 1998, Hauskrecht,

2000, Uther & Veloso, 1996). The common practice in function representation is

using a neural network, which has a disadvantage of being slow to train. Newly

existing approaches to value function approximation include methods, which are

based on fitted regression trees (Wang & Dietterich, 2002) and support vector

machines (Dietterich & Wang, 2002).

1.7.1.3 Large and continuous action spaces

To learn and act in a large continuous action space, one possible solution is to smooth

the probability transition functions of similar action during learning (Aberdeen,

Baxter, 2002, Meuleau, Kim, Kaelbling, Cassandra, 1999).

1.7.1.4 Intelligent exploration

In most RL methods, the agent is uninformed or minimally informed about the policy

search method to explore the environment. However, in practice, the RL system

designer can provide guidance to the agent in the form of an initial policy (Shelton,

2001, Shapiro, Langley & Shachter, 2000). This information is typically sub-optimal.

39

The research question arises on how to exploit the initial policy learning without

preventing the future development of the optimal policy.

1.7.2 Partial observability of the environment

RL is a common approach to the POMDP training problem (Murphy, 2000,

Schmidhuber, 1991). POMDPs are a generalization of MDPs where a finite set of

states depicts the environment’s dynamics but the agent does not have direct access to

the states. It is a model originated in the operations research (OR) literature for

describing planning tasks when complete information is inaccessible during the

current situation. The agent can only infer the state from a set of hopefully state

related observations (Aberdeen, 2002, Cassandra, 1999, Kaelbling, Littman & Moore,

1996, Williams & Singh, 1998).

The POMDP framework models the partially observable dynamics of the environment

(Figure 6). At each time step t , the environment is assumed in a state tx . The agent

performs an action ta to cause a state transition and the environment changes into a

new state tx . However, the state transition is hidden and the agent receives an

observation ty as some stochastic function of the state tx . If the observation

represents the state tt xy = , the POMDP becomes the ideal case of a fully observable

MDP. In addition, the agent receives a scalar reward and the goal of the agent is to

learn a policyπ that maximizes the return.

Figure 6: State transition dynamic under POMDP

40

1.7.3 The need of memory

Under partial observability, the current sensory information an agent receives may not

identify the hidden state. This is because the number of observations is typically far

less than the number of possible states, to be distinctive for policy mapping. And the

true states of the environment are said to be hidden.

1.8 Automatic state construction

Automatic state construction is a process, in which an agent constructs its own

(internal) state representation. It is a common solution in supervised learning to

problems where the current state does not match the input. (Bauer & Pawelzik, 1992,

Bauer & Villmann, 1995).

A self-organising neural network with state construction capability (e.g. Growing

Neural Gas) can add refined nodes into its state space to deal with dynamic input

distributions so as to approximate the input space more accurately.

For the aforementioned reasons in section 1.7, RL agents need the construction of

their own state space. Then, the focus is brought upon how these internal state

representations can be dynamically constructed from the observation history to allow

generalization of useful past experience to apply in new and different situations

(Aberdeen & Baxter, 2002, Dutch, 1998, Lovejoy, 1991, McCallum, 1995).

41

2 U-Tree: A RL algorithm with automatic state construction functionality

Chapter summary

Chapter 2 presents the U-Tree method (McCallum, 1995), which is a RL algorithm

with automatic state construction functionality. The U-Tree method allows an agent to

construct and refine a tree structured representation of its internal state and policy by

extracting relevant features from the RL system. These features are composed of

present and past observations of the environment or events. Hence, the policy derived

is reactive to both the current and past observations. This allows a form of short term

memory into the behaviour.

In section 2.1, the U-Tree algorithm is explained in terms of how decision tree

technique can be applied in internal state construction. Details of the Kolmogorov-

Smirnov Two Sample test, which is the selection criterion used for tree expansion and

internal state refinement are discussed.

Section 2.2 provides the pseudo code for the U-Tree algorithm in terms of how

learning can be facilitated by a U-Tree and how a U-Tree grows with experience.

Section 2.3 discusses the limitations and shortcomings of the U-Tree algorithm.

2.1 Introduction to the U-Tree algorithm

U-Tree (McCallum, 1995) is a RL algorithm designed to overcome the

aforementioned practical issues in RL research. This method focuses on situations

where purely reactive policy performs sub-optimally under partial observability.

The U-Tree algorithm allows the agent to extract relevant information to create its

own internal state representation. This internal state space serves as an abstraction of

the environment, which classifies the indistinctive states observed for learning to act

upon (Aberdeen & Baxter, 2002, Bakker, 2001, Dutch, 1998, Meuleau, Peshkin, Kim

& Kaelbling, 1999).

Leaves of a U-Tree partition the internal state space into a set of most currently

refined internal states. During leaf expansion, the internal state represented is further

refined. During leaf expansion, the internal state represented is further refined.

42

U-Tree is the policy of the agent essentially because it is a classification tree of the

current state of the agent, with value function estimates stored at the leaves for action

output.

2.1.1 Problem domain targeted by the U-Tree

The U-Tree algorithm was designed for partially observable environments with large

state space dimension. It is capable of dealing with both discrete and continuous state

spaces.

Large state space dimension results in abundance of observations. Many of the

observations are not task relevant and are not required for the internal state

construction. Feature extraction prunes away these irrelevances.

Hidden states occur when the current observation alone is insufficient to determine

the state of the environment (McCallum, 1995, Cassandra, 1999). Memory from

previous observations is needed to augment the current perceptual input to reveal the

hidden states.

For example, a driver agent can be overloaded with information of the surroundings

on the road. With respect to the task of driving safely, the agent needs to be aware of

the approaching and upcoming traffic and the traffic signals. Information that

describes an approaching vehicle, such as colour, is irrelevant and to be pruned away.

Since it is impossible to attend to both the frontal and rear views simultaneously,

hidden state problem arises when the agent considers changing lanes. Using only the

current frontal percept, it cannot distinguish between states, which correspond to the

presence and the absence of an approaching vehicle on the lane it wishes to change to.

These two undistinguished states are said to be hidden. In order to change lane safely,

the agent needs to augment the rear view information from previous perception to

current perception to reveal the hidden state.

2.1.2 The architecture of a U-Tree

Before the details of a U-Tree are presented, the concept of decision trees is

discussed.

2.1.2.1 Decision trees

A decision tree is a classifier. It assigns an input to a class amongst a finite number of

classes. The input of a decision tree can be an object or a situation that is described by

43

a set of attributes. Each internal node corresponds to an attribute. Edges originated

from a node are labelled with the possible values of the associated attribute. The input

is classified by performing a sequence of node-edge matches from the root to a leaf,

which represents a class label or a decision.

The following diagram illustrates a possible decision tree, which can be used to

determine when to play tennis. The input of the tree is the sky outlook. The edges

originated from the sky outlook node contain the possible values. They are extended

to their own class labels, which represent a decision upon which an action is made.

For example, it is decided not to play tennis on rainy days.

Figure 7: A decision tree for playing tennis according to the sky outlook

44

2.1.2.2 A U-Tree

A history table records the observations made at each time step. A U-Tree classifies

the history table (the input) into an internal state (the class label).

Table 3 below is a typical history table for a soccer agent before update at time t .

Time 1 2 K 1−t t Ball position Left Left K Unknown Unknown

Action Turn left Turn left K Turn right ? Reward 3 8 K 5.5− ?

Table 2: An example of a history table before update

In a U-Tree, each internal node corresponds to a test on a feature, which consists of a

observation f and its history index lag . The history index allows a form of short

term memory by specifying the lagging in history the observation at the node is to be

tested.

Figure 8: A U-Tree for a soccer agent to get to a ball

45

In Figure 9, the classification process begins at the root node where the observed ball

position at time t is examined. The value of the current ball position is found

‘Unknown’ and this links the root node to the internal node, which examines the

observed ball position at time 1−t . The value ball position in the immediate past is

found ‘Unknown’. A leaf (an internal state) is reached with the action

=ta ‘Panoramic Vision’ being carried out. A reward 6=tr is received and the

history table is updated.

Time 1 2 K 1−t t Ball position Left Left K Unknown Unknown

Action Turn left Turn left K Turn right Panoramic vision

Reward 3 8 K 5.5− 6 Table 3: An example of a history table after update

A U-Tree also represents the policy of a RL agent because the action value vector and

the action selection probability vector are stored at each leaf.

46

2.1.3 Feature extraction

Leaf expansion is periodically carried out to discover relevant features to extend the

tree. A pre-selected pool of candidate feature cf is available at each leaf. When an

appropriate wincf is found at a leaf for expansion, the leaf is extended into a sub-tree

(figure 10).

The internal state represented by the extended leaf is partitioned into a set of more

refined internal states. The associated timely indexed observations, including the

returns, will be classified into the new leaves according to the values of the candidate

feature wincf .

Figure 9: Partition of a state including returns during leaf expansion

2.1.3.1 The selection criterion

The distributional differences found between the return set and its subsets at a leaf

before and after the introduction of a candidate feature cf indicates the disparity of

cf in refining the internal state with respect to return prediction.

Consider a mobile RL agent, which learns obstacle avoidance behaviour with

obstacles that are identical in size but differ in colour. The candidate features are the

‘colour’ and the ‘size’ of the closest obstacle in view.

47

The size of the obstacle is task relevant but the colour is not since obstacles are to be

avoided regardless of their colour. Therefore, the return distribution difference, by the

feature ‘size’, should be more significant than that by the feature ‘colour’.

The Kolmogorov-Smirnov (K-S) Two Sample test is used to provide hypothesis

testing on the distributional difference in return distribution.

2.1.3.2 The Kolmogorov-Smirnov Two Sample test

The Kolmogorov-Smirnov (K-S) test is a statistical test that investigates the

difference between two distributions. The test compares two distributions and outputs

the likelihood of distributional difference.

The nature of this test is non-parametric. The null hypothesis assumes the equality in

distributions under comparison. The probability of the null hypothesis is computed in

terms of the maximum distributional difference.

Procedure

Let 1X and 2X be two distribution with samples 1M and 2M taken

1. Establish the null hypothesis 0H that the distributions of 1X and 2X are equal

2. Construct the class column of a cumulative frequency table

a. Find ( )21max sup MMx ∪=

b. Find ( )21min inf MMx ∪=

c. Partition minmax xx − into m intervals, each with equal interval length

mxxc minmax −=

d. Label the class column x with ( ){ }maxminmin ,1,, xcmxcx −++ L

3. List the cumulative frequencies ( )xXFX <11 and ( )xXFX <22

columns

x ( )xXFX <11 ( )xXFX <22

maxx mXn ,1 mXn ,2

M M M cx +min 1,1Xn 1,1Xn

4. Determine the relative cumulative frequencies columns

( ) ( )1

11

1

1 XxXF

xXE XX

<=< and ( ) ( )

2

22

2

2 XxXF

xXE XX

<=<

48

x ( )xXFX <11 ( )xXFX <22

( )xXEX <11 ( )xXEX <22

maxx mXn ,1 mXn ,2

( )mXn ,1Pr ( )mXn ,2

Pr

M M M M M cx +min 1,1Xn 1,1Xn ( )1,1

Pr Xn ( )1,1Pr Xn

5. Determine the absolute difference 21 XX EED −=

x ( )xXFX <11 ( )xXFX <22

( )xXEX <11 ( )xXEX <22

D

maxx mXn ,1 mXn ,2

( )mXn ,1Pr ( )mXn ,2

Pr md

M M M M M M cx +min 1,1Xn 1,1Xn ( )1,1

Pr Xn ( )1,1Pr Xn 1d

6. Identify the maximum difference ( )Dd supmax =

7. Compute the test statistic 21max MMdTSD ⋅⋅=

8. Check the Chi-square distribution 2, 21 MMχ and determine an acceptance level

( )Dϑ−1 such that if ( )DTS≤2Pr χ Dϑ> , reject null hypothesis 0H and

conclude that 1M and 2M come from different distributions

Figure 10: Probability density function of the 2χ distribution

49

2.2 Construction of a U-Tree

In this section, the U-Tree algorithm is described in pseudo code. It starts with an

overview of the algorithm and follows the description of the exploiting and the

exploring phases.

2.2.1 The U-Tree learning algorithm

Initialization Policy iteration frequency parameter improvefreq Internal state refinement frequency parameter refinefreq The maximum number of episode allowed maxEp The maximum number of time step allowed maxt in each episode The history table Η Initialize the U-Tree with the root node as the only state s For every ( )sAa∈ Initialize ( )asQ , to arbitrary constant Initialize policy π to random action selection

1=nEp // nEp is the episode number Repeat 1=t Repeat ta ←Choose-Action (U-Tree, Η ) 1+← tt Until maxtt = or the episode is completed Update the return R in Η If 0 mod improve =freqnEp ←π Policy-iteration (U-Tree, Η ) If 0 mod refine =freqnEp U-Tree ← Internal-State-Refinement (U-Tree, Η ) 1+← nEpnEp Until maxEpnEp =

50

2.2.2 Using the tree

The U-Tree classifies the history table Η to output ta at each time step

Function Choose-Action (U-Tree, Η ) returns ta

Append observation to to Η ←ta U-Tree-Classification (U-Tree, Η ) Append tt ra , to Η

2.2.3 Improving the tree

The improvement phase consists of the policy improvement process and the internal

state refinement process. Both processes are carried out periodically in accordance

with parameters improvefreq and refinefreq .

2.2.3.1 The Policy Iteration process

The policy iteration process requires the input of Η to update the action values

( )asQ , for each leaf s .

Function Policy-Iteration (U-Tree, Η ) returns U-Tree

For each leaf s For each ( )sAa∈ ( ) { }aasstasI tt === ,|,

( ) ( )

( )asI

RasR asIt

t

,, ,

∑∈=

For every Ss ∈′ ( ) { }ssaasstsasI ttt ′====′ +1,,|,,

( ) ( )( )asI

sasIass

,,,

,|Pr′

=′

( ) ( ) ( ) ( )∑′

′⋅′⋅+=s a

asQassasRasQ ,max,|Pr,, γ

=π Action-Selection-Probability-Update (Q ) Return U-Tree

51

2.2.3.2 The Internal State Refinement process

An internal state s can be refined by selecting the most suitable candidate feature

( )sCFcf ∈ for leaf expansion. The Kolmogorov-Smironov Two Sample test is used

to assess the return distributional difference when a candidate feature cf is

introduced. If the probability of the distributional difference exceeds a predetermined

threshold, the candidate feature currently considered will extend that leaf.

Function Internal-State-Refinement (U-Tree, Η ) returns U-Tree

For each leaf s For each ( )sCFcf ∈ Add a sub-tree at s by cf with potential leaves l For each l For each a ( ) ( ){ }aallalalI tttt ==← ,|,,

( )←alsd ,, KS-2-Sample-test (( ) ( )

⎟⎠⎞⎜

⎝⎛ ∪⎟

⎠⎞⎜

⎝⎛ ∪

∈∈ jalIjiasIiRR

,,, )

( ) ( )∑∑←l a

alsdcfsd ,,,

( )( )cfsdcfcfwin ,max←

If ( ) >wincfsd , predetermined threshold Dϑ For each l ( ) ( )asQalQ ,, ← sl ← wincfss ∪← ( ) ( ) wincfsFsF −← Return TreeU −

52

2.3 Shortcomings

2.3.1 Circular dependency

The U-Tree algorithm shares the most common weakness of the value based RL

paradigms where the model of the environment obtained may bias the optimality of

the policy. During the construction of the U-Tree, the building of the tree depends on

the current Q-value estimates. In turns, the Q-value estimates depend on the current

policy and the policy depends on the current structure of the tree. This circular

dependency has a potential downfall on the convergence upon the action value

estimates.

2.3.2 Pre-selection of the candidate feature

The set of pre-selected candidate features must contain all the observations minimally

required to represent the partially observed environment. The designer must have

some background knowledge of the situation to provide a set of candidate features.

53

3 Proposed method and results

Chapter summary

This chapter presents a variation of the U-Tree algorithm (McCallum, 1996). A new

feature extraction criterion, the Information Gain Ratio (IGR) test, is introduced to the

U-Tree framework for internal state construction.

Section 3.1 introduces the variant of the U-Tree by beginning with a description of the

original U-Tree. This then follows a section that discusses the technical background

of Information Theory and then on how it can be applied to give a new RL algorithm.

Section 3.2 provides the pseudo code for the IGR test U-Tree algorithm in terms of

the internal state refinement process.

A set of three experiments are conducted to assess the IGR test with respect to the K-

S test in growing a U-Tree. In section 3.3, the ANOVA test, which is used to compare

performance between the two U-Tree algorithms, is discussed. Then, the three

problem domains are described. These problem domains include robot navigation,

driving, and elevator control. Each problem domain subsection provides the

description of the environment, the task, the action set, the reward, the candidate

feature set, the training conditions, results and comparisons.

3.1 Introduction to the variant of the U-Tree method

A U-Tree is a tree-structured representation of both the internal states and the policy

for an RL agent. The internal states are refined when relevant observations are

extracted to grow the tree.

Our new variant of the U-Tree serves the same purpose. It differs from the original U-

Tree method in the feature extraction criterion, where the IGR test replaces the K-S

test.

3.1.1 Information Gain Ratio test

The IGR test is feature extraction criterion for decision tree learning (Mitchell, 1997),

corresponding to the C4.5 algorithm (Quinlan, 1993). IGR provides a disparity

measure when a sample is classified by a feature. It measures the expected reduction

in information caused by partitioning the sample according to the feature, with respect

to the homogeneity of that feature.

54

3.1.2 Procedure of the IGR test

Given a set of returns, discretized with possible values { }muuu , , , 21 K and a feature f

with values { }nvvv , , , 21 K

1. Estimate the probability ( )juR =Pr for mj , ,1 K= from the history

2. Calculate the Information (also known as the entropy) ( )RI , which measures

the homogeneity of the returns

( ) ( ) ( )∑=

==−=m

jjj uRuRRI

12 Prlog Pr

3. Calculate the Information Gain (also known as the conditional entropy),

which measures the expected reduction in information caused by partitioning

the return R according to the feature f

( ) ( ) ( ) ( )∑=

==−=n

kkk vfRIvfRIfRIG

1

| Pr|

4. Calculate the Information of the feature f

( ) ( ) ( )∑=

==−=n

jjj vfvffI

12 Prlog Pr

5. Calculate the Information Gain Ratio

( ) ( )( )fI

fRIGfRIGR || =

55

3.2 Construction of a U-Tree by IGR test

In the context of an U-Tree, the IGR test aims to select the most appropriate candidate

feature cf to classify the return set ( ){ }sJjRj ∈: at leaf s . The return values are

discretized to allow the calculation of the information on the return.

Function Internal-State-Refinement (U-Tree, Η ) returns U-Tree

For each leaf s of the U-Tree Retrieve time index set ( )sJ Retrieve return set ( ){ }sJjRj ∈: from the history table h ( )←sR discretized ( ){ }sJjRj ∈: For each candidate feature ( )sCFcf ∈ available at leaf s For each action a Find ( ) ( ) ( ){ }sJtaasRasR t ∈== ,|, Calculate the information ( )[ ]asRI , Calculate the information gain ( )[ ]cfasRIG |, Calculate the information ( )[ ]cfRI Calculate the information gain ratio ( )[ ]cfasRIGR |, ( )[ ] ( )[ ]∑←

acfasRIGRcfsRIGR |,|

( )[ ]( )cfsRIGRcfcf

win |maxarg←

If ( )[ ]wincfsRIGR | predetermined threshold ϑ For each value v of wincf Update Q-value estimate vector ( ) ( )asQavQ ,, ← Update action selection probability vector ( ) ( )asPavP || = Update candidate feature ( ) ( ) wincfsCFvCF −← Update new leaf vs ← Return U-Tree

56

3.3 Experimental results

We now compare the performance between the K-S test U-Tree and the IGR test U-

Tree in three RL problems. The problem domains include a robot navigation problem,

a New York driving problem and an elevator control problem.

The robot soccer navigation problem involves an agent that learns to position itself to

shoot a goal. The New York driving problem requires an agent to avoid traffics by

changing lanes safely (McCallum, 1995). The elevator control problem aim to

maintain passenger flow in a building for three elevators (Singh, Jaakkola & Jordan,

1995).

To allow automatic internal state construction, a set of candidate feature is pre-

selected for feature extraction in each problem. This set of candidate feature is a

product set formed under an observation set and a history index set.

The observation set consists of information, which the agent observes in the system.

The history index set consists of time lags, which indicate the elapsed time steps past

from current. Forming a product set from the observation set and lag set, the candidate

feature set is composed of both current and past observations. For example, the

candidate feature ‘observation X at lag 0’ denotes the current value of observation X;

the candidate feature ‘observation Y at lag 2’ indicates the value of Y observed two

time steps ago.

57

3.3.1 Using the Analysis of Variance statistical test

Analysis of Variance (ANOVA) is a statistical test that can be applied to make

comparisons amongst groups of data. In the context of the U-Tree experimentations,

ANOVA is applied to compare the performance variability.

3.3.1.1 One-way ANOVA

In a simple situation, given n groups of independent samples, each with m

observations; a one-way ANOVA test investigates the variability due to the

differences among the groups of samples by comparing the sample means.

Sample \ Observation 1 2 K m 1 1,1x 2,1x K mx ,1 2 1,2x 2,2x K mx ,2 M M M M M n 1,nx 2,nx K mnx ,

Table 4: Data table for a one-way ANOVA test

15 trials are conducted for each of the U-Tree experiments; the one-way ANOVA

investigates the variability in performance for a U-Tree algorithm amongst the

number of trials.

Trial \ Time 1 2 K 1 1,1p 2,1p K

2 1,2p 2,2p K

M M M M 15 1,15p 2,15p K

Table 5: A U-Tree performance data table for a one-way ANOVA test

The null hypothesis of a one-way ANOVA test assumes that all the samples are drawn

from the same population. The test returns a probability to indicate the likelihood of

the null hypothesis. A small probability returned suggests that at least one sample

mean is significantly different than the other sample means.

58

3.3.1.2 Two-way ANOVA

ANOVA is capable of providing factor analysis to investigate the variability of data

caused by some factors. Given a factor with two levels, A and B, data repeated n

trials, each has m observations for each factor level, a two-way ANOVA test

investigates the variability caused by the different factor levels.

Trial Factor \ Observation 1 2 K m A 1,1x 2,1x K mx ,1 1 B 1,1x 2,1x K mx ,1 A 1,2x 2,2x K mx ,2 2 B 1,2x 2,2x K mx ,2

M M M M M M

A 1,nx 2,nx K mnx , n

B 1,nx 2,nx K mnx ,

Table 6: Data table for a two-way ANOVA test

The two-way ANOVA is used to compare the performances of the two U-Tree

algorithms at the factor levels.

The null hypothesis of a two-way ANOVA test assumes that the different factor levels

have no effects on the samples. The test returns a probability indicating the likelihood

of the null hypothesis. A small probability returned suggests that any significant

variability in the data is contributed by the factor investigated.

59

3.3.2 A robot soccer problem

The Kiks simulator is used to develop the RL system in the robot soccer problem. The

system embeds an agent that learns to align itself to shoot a goal.

3.3.2.1 Description of a robot soccer problem

The environment consists of the following objects

o a field, which is 1200 mm by 700 mm,

o a single goal, which is 300 mm wide,

o a ball, which is 90 mm in diameter and

o the agent, which is 120 mm in diameter.

The position of the ball and that of the agent are randomly initialised.

Figure 11: A snap shot of the robot navigation problem

60

The set of actions

The action set of the soccer agent is composed of four actions. They include 3 motor

actions and 1 sensory action. These actions are listed in the following table. Action Description Move forward The agent moves forward by 10 mm Turn left The agent turns left by 9 degrees Turn right The agent turns right by 9 degrees Observe The agent perform a panoramic view to observe the position of the ball

and that of the goal

Table 7: The actions of the soccer agent

The reward

Two reinforcement functions are used in accordance with the different stages of

learning under the technique of Shaping.

During stage 1 of learning (figure 11), the agent learns to get within a desirable

distance from the ball. The following reinforcement function calculates the reward

proportionally to the absolute difference between the observed ball width from the

agent’s current position and the desired ball width.

21 cwidthwidthcr desiredball +−⋅=

where 21,cc are scaling constants

During stage 2 of learning (figure 11), the agent learns to position itself in a shooting

position given the learnt policy from stage 1. The reinforcement function expressed as

linear combination of the observed ball width and the observed angle between the

centre of the ball and that of the goal.

5/43 canglecwidthcr goalballball +⋅+⋅=

where 543 ,, ccc are scaling constants

61

The set of candidate features

The set of candidate features is a product set of an observation set and a history set.

The observation set describes what the agent observes on the soccer field. The history

index set consists of { }4,3,2,1,0=lag .

Observation Description Value Ball centre The apparent centre of the ball

relative to the agent Front

Rear

RightLeft

0 if Unknown

1 if Left

2 if Front

3 if Right

4 if Rear

Ball width The apparent width of the ball ballw relative to the agent

0 if 0=ballw 1 if mmwmm ball 30 1 <≤ 2 if mmwmm ball 60 31 <≤ 3 if mmwmm ball 90 61 <≤ 4 if mmwball 90≥

Goal centre The apparent centre of the goal relative to the agent

0 if Unknown

1 if Left

2 if Front

3 if Right

4 if Rear

Goal width The apparent width of the goal goalw relative to the agent

0 if 0=goalw 1 if mmwmm ball 100 1 <≤ 2 if mmwmm ball 200 101 <≤ 3 if mmwmm ball 300 201 <≤ 4 if mmwball 300≥

Angle from goal to ball

The apparent angle from the centre of the goal to that of the ball gab∠ relative to the agent

( ){ }ngabnn 66 1: ππ ≤∠<− , 12,,2,1 L=n

Previous action The previous action taken Move forward Turn left Turn right Observe

Last observe An interval indication of the number of time step t past since last observe action

1 if 50 <≤ t 2 if 105 <≤ t 3 if 10≥t

Random number A number randomly generated { }10,,1 K

Table 8: The set of observations in the robot soccer problem

62

Training conditions

Each training session is consisted of 200 episodes whilst each episode contains 50

iterations. An episode terminates if the agent achieves the tasks or if the maximum

number of iterations allowed is reached. The various parameters of the experiment are

listed. Note that a hand-crafted policy is used to obtain the most efficient set of

parameters for this experimental purpose.

Parameter Description Value

Exploration rate Rate that indicates the probability of

choosing random action in a ε -greedy

policy

⎩⎨⎧

=r thereafte0.15

10~1 epfor 1ε

Discount rate Rate that discounts the future rewards in

return computation

7.0=γ

Learning rate for

action value

Rate that determines the change ratio of the

action values with respect to new

experience

05.0=qβ

Learning rate for

action preference


action selection preferences with respect to

new experience

1.0=pβ

Frequency for

action value

update

The regularity of updating the action values,

expressed in terms of the number of

episodes

2=qfreq

Frequency for

action preference

update

The regularity of updating the action

selection preferences, expressed in terms of

the number of episodes

4=pfreq

Frequency for

internal state

refinement

The regularity of refining the internal states,

expressed in terms of the number of

episodes

5=sfreq

K-S test critical

region

Indicates the threshold probability ( )p−1

that the test statistic must exceed to reject

0H

1.0=p

Table 9: The set of parameters used in the robot soccer problem

63

The technique of Shaping is used in the robot soccer problem. Two stages of learning

with the use of two reinforcement functions are involved to solve the problem. The

first stage requires the agent to learn to get close to the ball. The second stage teaches

the agent to position to shoot a goal. The transition of learning from stage 1 to stage 2

conditions on the number of episode passed since the U-Tree was previously grown.

If the U-Tree exhibits no further development for a period of 40 episodes, stage 2

learning is initiated with the respective reinforcement function.

At the beginning of each episode, the agent observes its environment as its first

default action. The first 10 episodes are used for experience gathering. The U-Tree

begins its development at the end of the 10th episode and is checked for improvement

every 5 episodes.

3.3.2.2 Results

Both the K-S test and the IGR test used in the U-Tree algorithm succeeded in

achieving the objective. The RL agent, using the internal state tree structure obtained,

is capable at positioning itself to shoot a goal after training.

The sets of candidate features used for tree construction by the two algorithms are

similar. Both algorithms select the candidate feature ‘current ball position’ as the root

of the tree. The candidate feature ‘current ball position’ represents the location of the

ball centre respective to the panoramic view of the agent. It is logical that the agent

must find out where the ball is at present, in order to get close to it.

During tree construction, although the two algorithms choose a similar set of

candidate feature, the final tree structure outputs are different in size and order. Both

of the algorithms have not selected the random number feature to build the tree.

The performance comparison is made on the number of training steps required to

obtain an internal state tree structure, which enables the learning of an optimal policy.

Under this criterion, the K-S test U-Tree algorithm is found slower than the

Information Gain tree algorithm by 5 episode length on average.

64

Results from the K-S test U-Tree algorithm

The following figure of success rate demonstrates the performance of the U-Tree

algorithm, which uses the K-S test for feature extraction in solving the robot

navigation problem.

Figure 12: Robot navigation success rate per episode by K-S test

When experience is collected for learning from episode 1 to 10, a random policy is

used for exploration purpose. The performance of that period is poor.

Once the construction of the U-Tree has started, the performance is quickly improved.

At episode 50, the performance degraded upon the start of stage 2 learning. The

degradation in performance can be explained that the agent attempted to accomplish

the objective of stage 2 by using the tree structure learnt from stage 1. This internal

state representing tree structure is refined according to the given reinforcement

function under stage 2 learning. A good policy, which provides a high success rate in

position the agent in a shooting position, is obtained episode 70.

One-way ANOVA test indicates that the variability of the results is not significant at

95% confident level. In other words, there is a one in twenty chance that one or more

than one out of the fifteen trials shows significantly different results than the others, in

the robot navigation problem.

65

Results from the IGR test U-Tree algorithm

The figure shown below demonstrates the performance of the U-Tree algorithm with

the use of the IGR test in solving the robot navigation problem.

Figure 13: Robot navigation success rate per episode by IGR test

A random policy is used for exploration purpose from episode 1 to 10 and poor

performance results.

Upon the U-Tree is used, the performance improves quickly. The performance

degradation at episode 50 is a result of the transition in learning stages of different

objective. A good policy is obtained at episode 65 to provide a high success rate in

position the agent at a shooting position.

The variability of the results is not significant at 95% confident level, shown by the

one-way ANOVA test.

66

Comparison

Both the feature extraction criteria, the K-S test and the IGR test, have equivalent

efficiency in constructing a U-Tree to solve the robot navigation problem. This is

illustrated in the following diagram, which shows the cumulative success rates for the

two tests, during the different phases of the robot navigation problem, over 15 trials.

Figure 14: Cumulative success rates, respective to the different stage of learning, of the K-S and

IGR test over 200 episodes in the robot navigation problem

The observation is made under the following conditions.

o The internal state refinement process of the U-Tree algorithm occurs at a

frequency of every 5 episodes starting from episode 10

o The stage 2 of learning is invoked when a U-Tree has not been refined after 40

episodes

Under these given conditions, a two-way ANOVA test is used to investigate the

performance difference between the K-S test and the IGR test in terms of success rate.

The ANOVA test shows that the performance difference is insignificant at 99%

confidence level. In other words, there is a 1/100 chance that the performance

between the K-S test and the IGR test U-Tree is different in the robot soccer problem.

To further assess the two criteria, comparison is made in terms of the duration

required to accomplish a good policy. Experiments are conducted with the changes

made in the internal state refinement frequency as follow.

o Every 3 episodes

o Every 4 episodes

o Every 5 episodes

Since the objective of stage 1 is relatively simple, the stage 2 of learning is invoked

when an existing U-Tree remains unchanged for duration of 20 episodes.

67

The following figures illustrate the performance of the K-S test and the IGR test with

respect to the different internal state refinement frequencies.


IGR test over 200 episodes at the internal state refinement frequency = 5

68





69

The decrease in internal state refinement frequency lessens the experience collected

for refining an internal state space. Diagrammatically, this results in slight

degradation in performance of the K-S test U-Tree. However, the two-way ANOVA

test confirms that such difference is insignificant at 99% confidence level.

The following figures represent the coefficient of variation for the K-S statistics and

that for the IGR, respective to different length of experience (window size) collected.

The coefficient of variation is the ratio of the sample standard deviation to the sample

mean. It measures the spread of a sample as a proportion of its mean, expressed as a

percentage. A large coefficient of variation indicates high variability of the sample.

Therefore, the coefficient of variation can be used to illustrate the variability of the K-

S statistics and that of the IGR during feature extraction. The comparison is made in

extracting the first significant feature in the robot navigation problem.

Figure 18: The coefficient of variation for the K-S statistics during the first feature extraction

process in the robot navigation problem

70

Figure 19: The coefficient of variation for the IGR during the first feature extraction process in the robot navigation problem

Both the tests extract the current ball position correctly in this situation. However, the

K-S test exhibits a larger variance in the statistics calculated, especially when

experience is scarce, than the IGR test. In other words, the IGR test U-Tree is able to

learn more stably because the IGR test displays lower variation in return prediction

than the K-S test.

71

3.3.3 A New York driving problem

In the New York driving problem (McCallum, 1995), a Q-Learning agent learns to

drive. In order to drive safely, the agent must learn when and how to change lane to

avoid traffics.

3.3.3.1 Description of a New York driving problem

The environment of the New York driving problem is a one-way road, which consists

of four lanes traffic. The traffic includes the agent’s vehicle and other trucks. The

objective of the agent is to avoid collision with the trucks as it makes forward

progression.

The agent travels at a speed of 16 meters per seconds. It has a visual horizon of 66

meters in front and behind of the view point of the agent’s vehicle and it is capable of

changing lanes.

There are two types of trucks, the slow trucks and the fast trucks. The slow trucks

travel at a speed of 12 meters per seconds and they appear in front of the agent’s

vehicle. The fast trucks travel at a speed of 20 meters per second and they appear

behind the agent’s vehicle. All the trucks are not capable of changing lanes.

During the driving simulation, time is discrete at the resolution of half a second per

step. At each time step, the traffic positions are updated according to their speed.

Throughout the time step in which the agent’s vehicle changes lanes, it both shifts

lanes and moves forward its normal distance. At each time step, a new truck appears

in randomly selected lanes and both types of trucks are equally probable to come into

view.

Figure 20: A snap shot of the New York driving problem

72

The set of actions

The action set of the driving agent consists of 7 actions, 6 sensory actions and 1 motor

action and the actions are listed in the following table. Action Description

Observe forward left Look forward at closest truck in the left lane to the agent

Observe forward Look forward at closest truck in the current lane of the agent

Observe forward right Look forward at closest truck in the right lane to the agent

Observe backward left Look backward at closest truck in the left lane to the agent

Observe backward Look backward at closest truck in the current lane of the agent

Observe backward right Look backward at closest truck in the right lane to the agent

Move to observed lane Move to the lane previously observed

Note that the agent remains in the same lane when it intends to move

right / left on the right / left outer lane

Table 10: The actions of the driving agent

The agent uses these seven actions to navigate to avoid the traffic. It intends not to run

into slow trucks in front and not to be reached by fast trucks from behind. When the

agent runs into the rear of a slow truck in its lane, it performs a squeeze by scraping

the side of the truck to move forward. When the agent is reached by a fast truck from

behind, the fast truck begins to beep its horn until the agent moves out of its way.

The reward

In order to avoid the ‘squeeze’ and the ‘honk’ circumstances, the reinforcement

function delivers one of the three possible rewards at each time steps.

⎪⎩

⎪⎨

⎧

−−=

squeeze - 10 honked - 1

progressclear - 1.0r

73


The observation set describes what the agent observes on the road. The history index

set consists of { }4,3,2,1,0=lag .

Observation Description Value Agent lane The lane of the agent { }4,3,2,1 Closest front Distance of the closest truck in

front of the agent 1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck

Closest front left Distance of the closest truck on the front left of the agent

1 if 31 <≤ d 2 if 53 <≤ d 3 if 95 ≤≤ d 4 if 9>d 5 if clear of truck

Closest front right Distance of the closest truck on the front right of the agent


Closest rear Distance of the closest truck behind the agent


Closest rear left Distance of the closest truck on the rear left of the agent


Closest rear right Distance of the closest truck on the rear right of the agent


Hear horn Being honked Yes, No Previous action The previous action taken Observe forward left

Observe forward Observe forward right Observe backward left Observe backward Observe backward right Move to observed lane

Random number 3 A number randomly generated { }3,2,1 Random number 5 A number randomly generated { }5,4,3,2,1

Table 11: The set of observations in the New York driving problem

74

Training conditions

Note that a hand-crafted policy is used to obtain the most efficient set of parameters

for this experimental purpose.

Parameter Description Value

Exploration

rate

Rate that indicates the probability of

choosing random action in a ε -greedy

policy ⎪⎪⎩

⎪⎪⎨

⎧

>≤<≤<

≤

=

8000 t,1.08000t6001 ,2.06000t2001 ,4.0

2000 t,0.1

ε

Discount rate Rate that discounts the future rewards in

return computation

75.0=γ

Learning rate

for action

value

Rate that determines the change ratio of

the action values with respect to new

experience

05.0=qβ

Learning rate

for action

preference

Rate that determines the change ratio of

the action selection preferences with

respect to new experience

075.0=pβ

Frequency for

action value

update


values, expressed in terms of the number

of time steps

100=qfreq

Frequency for

action

preference

update


selection preferences, expressed in terms

of the number of time steps

200=pfreq

Frequency for

internal state

refinement

The regularity of refining the internal

states, expressed in terms of the number

of time steps

500=sfreq

K-S test

critical region

Indicates the threshold probability

( )p−1 that the test statistic must exceed

to reject 0H

01.0=p

Table 12: The set of parameters used in the New York driving problem

75

3.3.3.2 Results

In the New York driving problem, the agent explores with a random policy in the first

2000 time steps of a 20000 time step training session. Under the random policy, the

probability of having an accident is approximately 25%.

The U-Tree construction process starts at the 2000th step and the refinement process

of the internal states is accomplished every 500 steps onwards. At the end of training,

the policies obtained from the U-Trees constructed under both the K-S test and the

IGR test have reduced the number of accident counts, scrap or honk, to 1/6 of the

accidents made under a random policy. Unfortunately, both policies are not capable of

providing absolute safe driving in the problem.

Both the U-Trees commonly extract the ‘current closest front’ feature as the first

feature. This is logical because scrapping with a slow truck incurs the heaviest penalty

and it is a situation that can be trivially avoided.


The following figure illustrates the average accident counts for successive periods of

2000 time steps in length, for a K-S test U-Tree over 15 trials. Radical reduction in

accidents occurs between the 4000th and the 7000th step.

Figure 21: Average accident counts per every 2000 steps for a K-S test U-Tree

One-way ANOVA test indicates that the variability of the results over the 15 trials is

not significant at 95% confident level.

76

3.3.2.2.2 Results from the IGR test U-Tree algorithm

A similar figure demonstrates the average accident counts for an IGR test U-Tree over

15 trials. Accidents are significantly reduced between the 4000th and the 7000th step.

Figure 22: Average accident counts per every 2000 steps for a IGR test U-Tree

The variability of the results is not significant at 95% confident level, shown by the

one-way ANOVA test.

77

Comparison

The objective of the agent is to avoid two situations, being honked by trucks from rear

and scraping trucks in front. Performance comparison between the K-S test and the

IGR test U-Tree algorithm is made in terms of the honk count and scrape count.

The figure below compares the honk counts between the two U-Tree algorithms for

successive intervals of 2000 time steps in length over 15 trials. ANOVA test confirms

that the honk avoidance performance between the two U-Tree algorithms is

equivalent at 99% confidence level (P-value = 0.0064).

Figure 23: Average honk count of the K-S and IGR test per every 2000 steps

The following figure illustrates the scrap counts obtained for the U-Tree algorithms

for successive intervals of 2000 time steps in length over 15 trials. ANOVA test

concludes that the difference in scrap avoidance performance between the two U-Tree

algorithms is significant at 99% confidence level (P-value = 0.031).

Figure 24: Average scrap count of the K-S and IGR test per every 2000 steps

Diagrammatically, it shows that the average scrap counts of the IGR test U-Tree drops

significantly from the 4000th step. This count remains lower than that of the K-S test

78

U-Tree until the 16000th step when the performance of the two U-Trees becomes

indifferent.

This illustrates that the IGR test U-Tree is capable to extract relevant features in

earlier stage of training. In other words, the IGR test U-Tree requires less experience

and learns more quickly than the K-S test U-Tree. And this can be explained in the

following figures.

The figures below show the K-S statistics and the IGR observed, respective to

different length of experience collected, in extracting the first significant feature.

Figure 25: The K-S statistics for the first feature extraction process in the New York driving

problem

Figure 26: The IGR for the first feature extraction process in the New York driving problem

The K-S statistics plot (figure 26) shows clearly that the K-S test is unable to

differentiate and extract relevant feature given the time step scales in the plot. On the

other hand, the IGR test (figure 27) depicts information ‘the closest truck in front at

79

present’ to grow a U-Tree. This further strengthens the implication that the IGR test

U-Tree learns more quickly than the K-S test U-Tree.

80

3.3.4 An elevator control problem

In the elevator control problem, a Q-learning agent controls three elevators to

maintain the flow of passengers in a building of 10 floors. The agent aims to deliver

the arrivals to their destinations quickly.

3.3.4.1 Description of an elevator control problem

The environment is a simulation of 10-floor building with three elevators. Each

elevator has a maximum capacity of 10 passengers and must stop on a floor to unload

and upload any passenger.

Figure 27: A snap shot of the elevator control problem

The arrival number of passengers N on each floor at each time step, is controlled a

Poisson distribution with the probability function as follows.

( )!

Prn

nenN

λλ −

== where λ is the parametric rate

The maximum number of passengers allowed in the building is 40. For each

passenger arrived, the destination is equally probable on any floor. A passenger enters

an elevator only if the elevator travels in the direction of the passenger’s destination.

A passenger exits an elevator when the elevator stops at the passenger’s destination.

81

The set of actions

The set of actions for the elevator central control is a triple product of the action set of

each elevator. The three elevators are identical. The actions available for each elevator

are described in the following table.

Action Description Stay Stay on the current floor to upload and unload any passenger Move up Move up one floor Move down Move down one floor

Table 13: The actions of an individual elevator

Forming a triple product set from the actions of an individual elevator, the elevator

central control action set is composed of 27 actions. Two examples are given in the

following table.

Central control action Description Stay, Stay, Stay The three elevators stay on the current floor Move up, Move down, Stay Elevator 1 moves up one floor

Elevator 2 moves down one floor Elevator 3 stays on the same floor

Table 14: Examples of the central control action

The reward

The reinforcement function is negatively proportional to the total waiting time

incurred by all passengers found in the building and the elevators.

With a single passenger in the building, the reinforcement is given as follow.

⎩⎨⎧

>−≤−

=40 if 340 if

waitwait

waitwait

tttt

r

where waitt is the waiting time of a particular passenger

When more than one passenger arrives, the reinforcement function of the system is

expressed as the sum of the reinforcement contributed by each passenger.

82


Only current observation is used because memory is not required. Observation Description Value Motion 1 Motion of elevator 1 Up, Down, stop Motion 2 Motion of elevator 2 Up, Down, stop Motion 3 Motion of elevator 3 Up, Down, stop Occupancy 1 Occupancy of elevator 1 Empty, Not full, Full Occupancy 2 Occupancy of elevator 2 Empty, Not full, Full Occupancy 3 Occupancy of elevator 3 Empty, Not full, Full Passenger out 1 Presence of exiting passenger on

current floor of elevator 1 Yes, No

Passenger out 2 Presence of exiting passenger on current floor of elevator 2

Yes, No

Passenger out 3 Presence of exiting passenger on current floor of elevator 3

Yes, No

Wait above 1 Presence of passenger waiting on floors above elevator 1

Yes, No


Yes, No


Yes, No

Wait below 1 Presence of passenger waiting on floors below elevator 1

Yes, No


Yes, No


Yes, No

Long wait above 1 Presence of long waited passenger waiting on floors above elevator 1

Yes, No


Yes, No


Yes, No

Long wait below 1 Presence of long waited passenger waiting on floors below elevator 1

Yes, No


Yes, No


Yes, No

Long waited portion up 1

Percentage of long waited passenger going up in elevator 1

1 if %50< 2 if %50≥



1 if %50< 2 if %50≥



1 if %50< 2 if %50≥

Random number 3 A number randomly generated { }3,2,1 Random number 5 A number randomly generated { }5,4,3,2,1

Table 15: The set of observation in the elevator control problem

83

Training conditions

Note that a hand-crafted policy is used to obtain the most efficient set of parameters

for this experimental purpose. Parameter Description Value

Exploration

rate

Rate that indicates the probability of choosing

random action in a ε - greedy policy

⎪⎪⎩

⎪⎪⎨

⎧

>≤<≤<

≤

=

00051 t,1.000051t10001 ,2.0

10000t5001 ,4.05000 t,0.1

ε

Discount rate Rate that discounts the future rewards in return

computation 7.0=γ

Learning rate

for action value


action values with respect to new experience 05.0=qβ

Learning rate

for action

preference


action selection preferences with respect to new

experience

05.0=pβ

Frequency for

action value

update

The regularity of updating the action values,

expressed in terms of the number of time steps 150=qfreq

Frequency for

action

preference

update

The regularity of updating the action selection

preferences, expressed in terms of the number

of time steps

250=pfreq

Frequency for

internal state

refinement

The regularity of refining the internal states,

expressed in terms of the number of time steps 500=sfreq

K-S test critical

region Indicates the threshold probability ( )p−1 that

the test statistic must exceed to reject 0H

01.0=p

Poisson mean Passenger arrival rate parameter 8.2=λ

Table 16: The set of parameters used in the elevator control problem

During the phase of internal state refinement, the most relevant feature is extracted to

refine an internal state. Unlike the previous two experiments, the robot navigation and

the New York driving problem, the refinement process does not stop after the most

relevant feature is extracted in this elevator control problem. The new internal states

resulting from the extracted feature will be examined for refinement until further

partition is not possible.

84

3.3.4 2 Results

In the elevator control problem, the agent explores with a random policy in the first

5000 time steps of a 20000 time step training session. Under the random policy, the

average waiting time of a passenger is approximately 25 seconds.

The U-Tree construction process starts at the 5000th step and the refinement process

of the internal states is accomplished every 500 steps onwards. At the end of training,

the policies obtained from the U-Trees constructed under both the K-S test and the

IGR test have reduced the average waiting time, to approximately 5 seconds. This

indicates an improvement of 80% in average waiting time in comparison with a

random policy.


Figure 28 shows the average waiting time per passenger under the K-S test U-Tree in

the elevator control problem. When the U-Tree construction starts, the U-Tree reduces

the average waiting time by approximately 8 minutes initially. Gradual reduction

occurs until the end of training. Note that the steps-wise improvement in performance

reflects the duration taken for a relevant feature to be extracted into the tree-structured

policy.

Figure 28: Average waiting time of passengers under the K-S test U-Tree over 15 trials

85

Results from the IGR U-Tree algorithm

A similar figure demonstrates the average waiting time per passenger under the IGR

test U-Tree in the elevator control problem. Upon the U-Tree construction, the

average waiting time is initially reduced by approximately 12 minutes with gradual

reduction follows until the end of training. Note that the steps-wise improvement in

performance reflects the duration taken for a relevant feature to be extracted into the

tree-structured policy.

Figure 29: Average waiting time of passengers under the IGR test U-Tree over 15 trials

Comparison

Both algorithms are capable of providing a tree-structured policy that lead to stepwise

improvement. Although a two-way ANOVA test concludes that no difference exists

in performance between the K-S test and the IGR test U-Trees at 99% confidence

level, it is shown that the IGR test is capable to extract more relevant feature during

earlier training. This indicates that the IGR test U-Tree can learn more rapidly than

the K-S test U-Tree.

86

4 Conclusion In this thesis, automatic state construction problems have been investigated to

illustrate practical value of automatic state construction in real RL applications.

Decision tree technique is applied in state construction to result a tree-structured

(internal) state representation and policy.

Automatic state construction allows an RL agent to progressively learn and refine its

own state representation when the current state of the environment does not match the

input. During state construction, only task relevant observations are extracted, which

results in a simpler and more compact state representation than if the state space was

defined over all the possible combinations of observations. A large state space can

cause scalability problems to RL algorithms and automatic state construction helps to

reduce the occurrence of such problem.

Task relevant observations include observations made in the past. When a past

observation is extracted for state construction, a form of short-term memory is

allowed into the behaviour of the agent. The policy of the agent is said to be reactive

to the states, which are defined by relevant observations from both present and the

past.

The U-Tree algorithm (McCallum, 1995) is an RL method, which uses a decision tree

technique in state construction. The K-S test is the leaf expansion criterion, which

evaluates the return distributional difference when a feature is introduced at a leaf. We

presented a variant of the U-Tree, which uses the IGR test as the leaf expansion

criterion. The IGR test measures the disparity of the returns being classified by a

feature at a leaf.

In discrete domains, the state construction functionality of both the K-S test and the

IGR test U-Tree is experimentally demonstrated. The major advantage of our IGR test

U-Tree approach is that the IGR test U-Tree produces a more robust state

representation and enables faster learning. This better performance can be explained

by the fact that the IGR exhibits a lower variability in return prediction than the K-S

test used in the original U-Tree. Future research can be oriented in the generalization

of the U-Tree approach in continuous domain.

87

5 Contributions In my master project, the application of decision tree technique in automatic state

construction is investigated. This involves the implementation of the simulations and

the tree structure. The U-Tree framework is studied and IGR test is applied to produce

a variant of the U-Tree. Three set of experiments were conducted to compare the

original U-Tree and the new variant. Performances were evaluated and the results

were accepted in the CIMCA conference 2004.

88

6 References Abe & Nakamura (1999). Learning to Optimally Schedule Internet Banner Advertisements. ICML 1999: 12-21 Aberdeen,D., (2002). A survey of approximate methods for solving POMDPs. RSISE, Australian National University. Aberdeen, D., & Baxter, J. (2002). Internal state policy gradient algorithms for infinite horizon pomdps (Technical Report). RSISE, Australian National University. http://discus.anu.edu.au/~daa/papers.html. Astrom, K.J. (1965). Optimal control of Markov decision processes with incomplete state estimation. Journal of Mathematical Analysis and Application, 10. Baird, L.C., & Moore, A.W. (1999). Gradient descent for general reinforcement learning. Advances in Neural Information Processing System 11. MIT Press. http://www.cs.cmu.edu/~eduleemon/paper/index.html. Bakker, B. (2001). Reinforcement learning with lstm in non-markovian tasks with long term dependencies (Technical Report). Leiden Univerisity. http://fsw.leidenuniv.nl/www/w3_func/bbaker/abstracts.htm.

Bauer, H.-U., & Pawelzik, K. (1992). Quantifying the neighbourhood preservation of self-organising maps, IEEE Transactions on Neural Networks, 3(4), 570-579.

Bauer, H.-U., & Villmann, Th. (1995). Growing a hypercubical output space in a self-organising feature map. Technical Report TR-95-030, ICSI, Berkeley, July.

Bellman, R. (1957). Dynamic programming. Princeton, N.J.:Princeton University Press. Blythe, J. (1999) Decision-theoretic Planning. AI Magazine, 1. http://www.isi.edu/~blythe/papers/asmag.html. Cassandra, A. (1998). Exact and approximate algorithms for partially observable Markov decision processes. Doctoral dissertation, Brown University. Cassandra, A,R. (1999). POMDPs for Dummies: POMDPs and their algorithms, sans formula http://www.cs.brown.edu/research/ai/pomdp/tutorial/iindex.html Cassandra, A.R., Kaelbling, L.P., & Littman, M.L. (1994). Acting optimally in partially observable stochastic domains. Proceedings of the Twelfth National Conference on Artificial Intelligence. Seattle, WA. Crites, R.H., and Barto, A.G. (1996). Improving elevator performance using reinforcement learning. In D.S. Touretzky, M.C. Mozer, M.E. Hasselmo (eds.), Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1017-1023. MIT Press, Cambridge, MA.

89

Dietterich, T.G. (2000). An overview of MAXQ hierarchical reinforcement learning. SARA (pp.26-44). Dietterich & Wang (2002). Batch value function approximation via support vectors. Accepted for publication in Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.) Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Dutch, A. (1998). Solving POMDPs using selected past events. Proceedings of the 14th European Conference on Artificial Intelligence. Gullapalli, V. (1992). A Comparison of Supervised and Reinforcement Learning Methods on a Reinforcement Learning Task. Computer and Information Science Department, University of Massachusetts. Hausen, E.A. (1998). Solving POMDPs by searching in policy space. Eighth Conference on Uncertainty in Artificial Intelligence. (pp.211-219). Madison, WI. Hauskrecht, M. (2000). Value function approximations for partially observable markov decision process. Journal of Artificial Intelligence Research 13, 33-94 Hernandez-Gardiol, N. & Mahadevan, S. (2000). Hierarchical Memory-Based Reinforcement Learning. Proceedings of Neural Information Processing Systems, 2001. Hochreiter, S., & Schmidhuber, J. (1997). Long short term memory. Neural Computation, 9, 1735-1780. Holland, J.H. (1992). Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor. Kaelbling, L.P., Littman, L.M., & Moore, W.A. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 237-285 Lanzi, P.L. (2000). Adaptive agents with reinforcement learning and internal memory. Sixth Internal Conference on the Simulation of Adaptive Behavior (SAB2000). Lin, L.J. & Mitchell, T.M. (1992). Memory approaches to reinforcement learning in non-Markovian domains (Technical Report CS-92-138). Carbegie Nellon, Pittsburgh, PA. Littman, M.L. & Sun, R. (2000). Value-function reinforcement learning in Markov games. Journal of Cognitive Research, 2:55-66, 2001 Lovejoy, W.S. (1991). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28, 47-65.

90

McCallum, A.K. (1995). Learning to Use Selective Attention and Short-Tern Memory in Sequential Tasks. Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, The MIT Press, pp. 315—324. McCallum, A.K. (1996). Reinforcement learning with selective perception and hidden states. Doctoral dissertation, University of Rochester. Meuleau, N., Peshkin, L., Kim, K.E., & Kaelbling, L.P. (1999). Learning Finite State Controller for Partially Observable Environments. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. KTH, Sweden:Morgan Kaufmann. Meuleau, N., Kim, K.E., Kaelbling, L.P. & Cassandra, A.R. (1999). Learning Finite State Controller for Partially Observable Environment. Mitchell, T.M., (1997). Machine learning. New York: McGraw-Hill. Murphy, K.P. (2000). A Survey of POMDP Solution Techniques (Technical Report). Dept. of Computer Science, U.C.Berkeley. Ng, A.Y. & Jordon, M. PEGASUS: A policy search method for large MPDs and POMDPs. In C.Boutilier and M. Goldszmidt, editors, Proceedings of the 16th Conference on Machine Learning, pages 278-287. Morgan Kaufmann, San Francisco, CA, 1999. Peshkin, L.M. (2000). Thesis proposal: Architecture for policy search. Puterman, M.L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York. Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann. Schmidhuber, J.H. (1991). Reinforcement learning in markovian and non-markovian environments. Advances in Neural Information Processing Systems (pp. 500-506). Morgan Kaufmann Publishers, Inc. Schneider (1999). Distributed Value Functions. Proc. 16th International Conf. on Machine Learning. Selfidge, O.J., Sutton, R.S., & Barto, A.G. (1985). Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110:568-582. Shelton, C.R. (2001) Policy improvement for POMDPs using normalized importance sampling (Technical Report AI Memo 2001-002). MIT, Cambridge, MA. Shapiro, D., Langley, P., & Shachter, R. (2000). Using Background Knowledge to Speed Reinforcement Learning in Physical Agents. Proceedings of the Fifth International Conference on Autonomous Agents.

91

Singh, S.P., Jaakkola, T., & Jordan, M.I. (1995). Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems (pp. 361-368). The MIT Press. Strens, M.J.A., & Moore, A.W. (2002). Policy Search using Paired Comparisons. Journal of Machine Learning Research 3 (2002), 921-950 Sutton, R.S., & Barto, A.G. (1998). Reinforcement Learning: An Introduction. Cambridge MA: MIT Press. Uther, W.T.B., & Veloso, M.M., (1996). Tree Based Discretization for Continuous State Space Reinforcement Learning. Proceedings of AAAI-98, Madison, WI, 1998. Walkins, C.J.C.H., & Dayan, P. (1992). Q-learning, Machine Learning, (pp249-292). Wang & Dietterich (2002). Stabilizing value function approximation with the BFBP algorithm. Accepted for publication, Dietterich, T. G., Becker, S., Ghahramani, Z. (Eds.) Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Williams, J.K., & Singh, S. (1998). Experimental results on learning stochastic memoryless policies for partially observable Markov decision processes. Advances in Neural Information Processing Systems. Zhang, W., and Dietterich, T.G. (1996). High-performance job-shop scheduling with time-delay TD(λ ) network. In D.S. Touretzky, M.C. Mozer, M.E. Hasselmo (eds.), Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1024-1030. MIT Press, Cambridge, MA.

Documents

Manix Au - Automatic State Construction using DT for RL agen · Manix Au 04.03.2004 . 8 Introduction To The Thesis Reinforcement Learning Reinforcement Learning (RL) is a computational