Upload
stephen-wells
View
213
Download
0
Embed Size (px)
Citation preview
Reinforcement Learning in Simulated Soccer with Kohonen Networks
Chris White and David BroganUniversity of Virginia
Department of Computer Science
Simulated Soccer How does agent decide what to do
with the ball?
Complexities Continuous inputs High dimensionality
Reinforcement Learning (RL) Learning to associate utility values with
state-action pairs Agent incrementally updates value
associated with each state-action pair based on interaction with environment
(Russell & Norvig)
Problems State space explodes exponentially in
terms of dimensionality Current methods of managing state
space explosion lack automation
RL does not scale well to problems with complexities of simulated soccer…
Quantization Divide State Space into regions of
interest Tile Coding (Sutton & Barto, 1998)
No automated method for regions granularity Heterogeneity location
Prefer a learned abstraction of state space
Kohonen Networks Clustering
algorithm Data
driven
Voronoi Diagram
Agent nearopponent goal
Teammate nearopponent goal
No nearbyopponents
State Space Reduction 90 continuous valued inputs
describe state of a soccer game Naïve discretization 290 states Filter out unnecessary inputs still
218 states Clustering algorithm only 5000
states Big Win!!!
Two Pass Algorithm Pass 1:
Use Kohonen Network and large training set to learn state space
Pass 2: Use Reinforcement Learning to learn
utilities for states (SARSA)
Fragility of Learned Actions
What happens to attacker’s utility if goalie crosses dotted line?
Unresolved Issues Increased generalization leads to
frequency aliasing…
This becomes a sampling problem…
vs.
Few samples Many samples
Example: Riemann Sum
Aliasing & Sampling Utility function not band limited How can we sample to reduce
error? Uniformly increase sampling rate?
(not the best idea) Adaptively super sample? Choose sample points based on
special criteria?
Forcing Functions Use a forcing function to only
sample action in a state when it is likely to be effective (valleys are ignored) Reduces variance in experienced
reward for state-action pair How do we create such a forcing
function?
Results Evaluate three systems
Control – Random action selection SARSA Forcing Function
Evaluation criteria Goals scored Time of possession
Cumulative ScoreSARSA vs. Random Policy
0
100
200
300
400
500
600
700
800
900
1 55 109
163
217
271
325
379
433
487
541
595
649
703
757
811
865
919
Games Played
Cu
mu
lati
ve G
oal
s S
core
d
Learning Team
Random Team
Time of PossessionTime of Possession
0
1000
2000
3000
4000
5000
6000
1 60 119
178
237
296
355
414
473
532
591
650
709
768
827
886
945
Games Played
Tim
e o
f P
oss
essi
on
Time of Possession
Team with Forcing Functions
SARSA with Forcing Function vs. Random Policy
0
200
400
600
800
1000
12001 65 129
193
257
321
385
449
513
577
641
705
769
833
897
Games Played
Cu
mu
lati
ve S
core
Learning Team with ForcingFunctions
Random Team Against Teamwith Forcing Functions
With Forcing vs. WithoutPerformance With Forcing Functions vs Performance Without Forcing Functions
0
200
400
600
800
1000
1200
1 53 105
157
209
261
313
365
417
469
521
573
625
677
729
781
833
885
937
Games Played
Cu
mu
lati
ve S
core
Learning Team Without ForcingFunctions
Random Team Against Team WithoutForcing Functions
Learning Team with Forcing Functions
Random Team Against Team withForcing Functions
Summary Two-Pass learning algorithm for
simulated soccer State space abstraction is automated Data driven technique
Improved state of the art for simulated soccer
Future Work Learned distance metric
Additional automation in process Better generalization