Upload
lorraine-henderson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Study on Genetic Network Study on Genetic Network Programming (GNP) with Programming (GNP) with Learning and EvolutionLearning and Evolution
Hirasawa laboratory,Artificial Intelligence section
Information architecture fieldGraduate School of Information, Production and Systems
Waseda University
I Research BackgroundI Research Background
Intelligent systems(evolutionary and learning algorithms)
can solve problems automatically
Systems are becoming large and complex robot control elevator Group Control System Stock trading system
It is very difficult to make efficient control rulesconsidering many kinds of real world phenomena
II Objective of the researchII Objective of the research
• propose an algorithm which combines evolution and learning
– In the natural world ・・・• evolution ― Many individuals (living things) adapt to the world (environment) through long time of generations
• learning ― the knowledge the living things acquire in their life time through trial-and-error
give inherent functions and characteristics to the living things
the knowledge acquired in the course of their life
III evolutionIII evolution
selection crossover mutation
EvolutionEvolution
Characteristics of living things are determined by genes
Evolution is realized the following components
Evolution gives inherent characteristics and Functions
Selection
Those who fit into an environment survive,otherwise die out.
Crossover
Genes are exchanged between two individuals
MutationSome of the genes of the selected individuals are changed to other ones
New individuals are produced
New individuals are produced
IV learningIV learning
Important factors in Important factors in reinforcement learningreinforcement learning
• State transition (definition of states and actions)
• Trial and error learning
• Future prediction
Framework of Framework of Reinforcement LearningReinforcement Learning
• Learn action rules through the interaction between an agent and an environment.
agentagent
environmentenvironment
Action
State signal( sensor input )
Reward( evaluation on the action )
The aim of RL is to maximize the total rewards obtained from the environment
rt+n
State transitionState transition
• State transition
stst+1 st+2 st+n
……at
An action taken at time t
State at time t
G
Example: maze problem
start
G G
Goal!!
st st+1 st+2
Reward 100
st+n
at: move right at+1: move upwardat+2: move left at+n: do nothing (end)
……
Reward rt
at+1 at+2
rt+1 rt+2
Trial-and-error learningTrial-and-error learning
concept of reinforcement learning
trial and error learning method
Decide an action
take the action
Success (get reward)
Failure (get negative reward)
Take this action again!
Don’t take this action
again
Reward (scalar value): indicate whether good action or not
Acquired knowledge
agent
Future predictionFuture prediction• Reinforcement learning estimates the future rewards and
take actions
st+1
st+2
st+3
at
Reward rt
at+1
at+2
rt+1
rt+2
st
current time
future
Future predictionFuture prediction• Reinforcement learning considers the rewards not only
current but also the future rewards
st+1 st+2 st+3
at
Reward rt=1
at+1 at+2
rt+1=1 rt+2=1
st
st+1 st+2 st+3
at
Reward rt=0
at+1 at+2
rt+1=0 rt+2=100
Case 1
Case 2
V GNP with evolution and learning V GNP with evolution and learning
Genetic Network Programming (GNP)Genetic Network Programming (GNP)
GNP is an Evolutionary Computation.
What’s Evolutionary Computation ?
solution gene=
• Solutions (programs) are represented by genes• The programs are evolved (changed) by selection, crossover and mutation
Structure of GNPStructure of GNP
Graph structure
0 0 3 4
0 1 1 6
0 2 5 7
1 0 8 0
1 0 0 4
1 5 1 2
… … … …
gene structure
• GNP represents its programs using directed graph structures.• The graph structures can be represented as gene structures.• The graph structure is composed of processing nodes and judgment nodes.
Khepera robotKhepera robot
• Khepera robot is used for the performance evaluation of GNP
obstacle
sensorFar from obstacles
Close to obstaclesClose to zero
Close to 1023
Sensor value
wheel
Speed of the right wheel VR
Speed of the left wheel VL
-10 (back) ~ 10 (forward)
-10 (back) ~ 10 (forward)
Node functionsNode functionsProcessing node
Judgment node
Each node determines an agent action
Each node selects a branch based on the judgment result
Set the speed of the right wheel at 10
Ex) khepera robot behavior
Judge the value of sensor 1
500 or more
Less than 500
An example of node transitionAn example of node transition
Judge sensor 1
Judge sensor 5
Set the speed of the right wheel at 5
The value is 700 or more
The value is less than 700
80 or more
Less than 80
Generate an initial population (initial programs)
Task executionReinforcement Learning
EvolutionSelection / Crossover / Mutation
Last generation
one generation
Flowchart of GNPFlowchart of GNP
stop
start
Evolution of GNPEvolution of GNPselection
Select good individuals (programs) from the population based on their fitness
Fitness indicates how much each individual achieves a given task
used for crossover and mutation
・・・
GNP population
Evolution of GNPEvolution of GNPcrossover
Some nodes and their connections are exchanged.
Individual 1 Individual 2
mutation
Change connections
Change node function
Speed of Right wheel: 5
Speed of Left wheel: 10
The role of LearningThe role of LearningExample)
Set the speed of the right wheel at 10
Collision!
Judge sensor 0
1000 or more
Less than 1000
1000 is changed to 500 in order to judge obstacle sensitively
Judgment node
10 is changed to 5 not to collide with the obstacle
Processing nodeNode parameters are changed by reinforcement learning
The aim of combining The aim of combining evolution and learningevolution and learning
• create efficient programs• search for solutions faster
Evolution uses many individuals and better ones are selected after task execution
Learning uses one individuals and better action rules can be determined during task execution
VI SimulationVI Simulation• Wall-following behavior
1. All the sensor values must not be more than 1000
2. At least one sensor value is more than 100
3. Move straight 4. Move fast
Simulation environment
Ctvtvtvtv
(t) LRLR
20
)()(1
20
)()(Reward
1000/Rewardfitness1000
1
t
(t)
: If the condition 1 and 2 is satisfied
0
1C
: otherwise
Node functionsNode functions
Processing node (2 kinds) Judgment node (8 kinds)
Determine the speed of right wheelDetermine the speed of left wheel
Judge the value of sensor 0
Judge the value of sensor 7
.....
0
0.2
0.4
0.6
0.8
0 200 400 600 800 1000
Simulation resultSimulation result
• conditions– The number of
individuals: 600
– The number of nodes: 34
• Judgement nodes: 24• Processing nodes: 10
fitn
ess
generation
GNP with learning and evolution
Standard GNP (GNP with evolution)
fitness curves of the best individuals averaged over 30 independent simulations
start
Track of the robot
startstart
Simulations in the Simulations in the inexperienced environmentsinexperienced environmentsSimulation on the generalization ability
The robot can show the wall-following behavior.
The best program obtained in the previous environment
Execute in the inexperienced environment
VII ConclusionVII Conclusion
• The algorithm of GNP using evolution and reinforcement learning is proposed.
– From the simulation results, the proposed method can learn wall-following behavior well.
• Future work
– Apply GNP with evolution and reinforcement learning to real world applications
• Elevator control system• Stock trading model
– Compare with other evolutionary algorithms
VIII other simulationsVIII other simulations
Example of tileworld
wall
floor
tile
agent
Agent can push a tile and drop it into a hole.
The aim of agent is to drop tiles into holes as many as possible.
tileworld
hole
Fitness = the number of dropped tilesReward rt = 1 (when dropping a tile into a hole)
Node functionsNode functions
Processing node Judgement node
go forward
turn right
turn left
stay
What is in the forward cell ? (floor, tile, hole, wall or agent) backward cell left cell right cell the direction of the nearest tile (forward, backward, left, right or nothing) the direction of the nearest hole the direction of the nearest hole from the nearest tile the direction of the second nearest tile
Example of node transitionExample of node transition
What is in the forward?
Direction of the nearest holebackward
right
left nothing
floorwall
agent
tile
hole
Go forward
forward
Simulation 1Simulation 1
– There are 30 tiles and 30 holes
– same environment every generation
– Time limit: 150 steps Environment I
Fitness curveFitness curve (( simulation simulation 11 ))
fitn
es
s
generation
GNP with evolution
GNP with learning and Evolution
EP ( evolution of Finite State Machine )
GP-ADFs (main tree : max depth 3
GP (max depth 5)
ADF: depth 2)
0
5
10
15
20
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Simulation 2Simulation 2
• Put 20 tiles and 20 holes at random positions
• One tile and one hole appear just after an agent push a tile into a hole
• Time limit: 300 stepsEnvironment II( example of an initial state )
Fitness curve (simulation 2)Fitness curve (simulation 2)
fitn
es
s
generation
GNP with evolution
GNP with learning and evolution
EP
GP (max depth 5)
GP-ADFs (main tree : max depth3ADF: depth 2)
0
5
10
15
20
25
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Ratio of used nodesRatio of used nodes
Ratio
of u
sed n
odes
Node function
Go fo
rward
Turn
left
Turn
right
Do n
oth
ing
Judge fo
rward
Judge b
ackw
ard
Judge le
ft side
Judge rig
ht sid
e
Dire
ction o
f tile
dire
ction o
f hole
Dire
ction
of h
ole
from
tile
Seco
nd n
eare
st tile
Node function
Ratio
of u
sed n
odes
Initial generation Last generation
Go fo
rward
Turn
left
Turn
right
Do n
oth
ing
Judge fo
rward
Judge b
ackw
ard
Judge le
ft side
Judge rig
ht sid
e
Dire
ction o
f tile
dire
ction o
f hole
Dire
ction
of h
ole
from
tile
Seco
nd n
eare
st tile
Summary of the simulationsSummary of the simulations
GNP-LE GNP-E GP GP-ADFs EP
Mean fitness 21.23 18.00 14.00 15.43 16.30
Standard deviation 2.73 1.88 4.00 1.94 1.99
T-test(p value)
GNP-LEGNP-E
1.04×10-6 3.13×10-17
3.17×10-11
3.03×10-13
1.32×10-6
5.31×10-11
5.95×10-4
Simulation I
GNP-LE GNP-E GP GP-ADFs EP
Mean fitness 19.93 15.30 6.10 6.67 14.40
Standard deviation 2.43 3.88 1.75 3.19 2.54
T-test(p value)
GNP-LEGNP-E
5.90×10-8 1.53×10-31
5.91×10-15
7.46×10-26
1.36×10-13
2.90×10-12
1.46×10-1
Simulation II
Data on the best individuals obtained at the last generation (30 samples)
Summary of the simulationsSummary of the simulations
GNP with LE
GNP with E GP GP-ADFs EP
Calculation time for 5000 generations [s]
1,717 1,019 3,281 3,252 2,802
Ratio of GNP with E (1) to each
1.68 1 3.22 3.19 2.75
Simulation I
GNP with LE
GNP with E GP GP-ADFs EP
Calculation time for 5000 generations [s]
2,734 1,177 12,059 5,921 1,584
Ratio of GNP with E (1) to each
2.32 1 10.25 5.03 1.35
Simulation II
Calculation time comparison
The program obtained by The program obtained by GNPGNP
0 step12345678910111213141516
K G wall
floor
door
agent
Maze problem
K
G
key
goal
fitness=
reward rt = 1 (when reaching the goal)
Remaining time ( when reaching the goal ) 0 ( when the agent cannot reach the goal )
objective : reach goal as early as possible
The key is necessary to open the door in front of the goal
Time limit: 300 step
Processing node Judgment node
go forward
turn right
turn left
random (take one of three actions randomly)
Judge forward cell Judge backward cell
Judge left cell Judge right cell
Node functionsNode functions
0
100
200
300
0 1000 2000 3000
Fitness curveFitness curve (( maze maze problemproblem ))
Fitnes
s
generation
GP GNP with evolution (GNP-E)
GNP with learning and Evolution (GNP-LE)
Data on the best individuals obtained at the last generation (30 samples)
GNP-LE GNP-E GP
mean 253.0 246.2 227.0
Standard deviation 0.00 2.30 37.4
Ratio of reaching the goal 100% 100% 100%
Ratio of obtaining the optimal policy 100% 3.3% 63%