Reinforcement Learning in Simulated Soccer with Kohonen Networks Chris White and David Brogan University of Virginia Department of Computer Science

Reinforcement Learning in Simulated Soccer with Kohonen Networks

Chris White and David BroganUniversity of Virginia

Department of Computer Science

Simulated Soccer How does agent decide what to do

with the ball?

Complexities Continuous inputs High dimensionality

Reinforcement Learning (RL) Learning to associate utility values with

state-action pairs Agent incrementally updates value

associated with each state-action pair based on interaction with environment

(Russell & Norvig)

Problems State space explodes exponentially in

terms of dimensionality Current methods of managing state

space explosion lack automation

RL does not scale well to problems with complexities of simulated soccer…

Quantization Divide State Space into regions of

interest Tile Coding (Sutton & Barto, 1998)

No automated method for regions granularity Heterogeneity location

Prefer a learned abstraction of state space

Kohonen Networks Clustering

algorithm Data

driven

Voronoi Diagram

Agent nearopponent goal

Teammate nearopponent goal

No nearbyopponents

State Space Reduction 90 continuous valued inputs

describe state of a soccer game Naïve discretization 290 states Filter out unnecessary inputs still

218 states Clustering algorithm only 5000

states Big Win!!!

Two Pass Algorithm Pass 1:

Use Kohonen Network and large training set to learn state space

Pass 2: Use Reinforcement Learning to learn

utilities for states (SARSA)

Fragility of Learned Actions

What happens to attacker’s utility if goalie crosses dotted line?

Unresolved Issues Increased generalization leads to

frequency aliasing…

This becomes a sampling problem…

vs.

Few samples Many samples

Example: Riemann Sum

Aliasing & Sampling Utility function not band limited How can we sample to reduce

error? Uniformly increase sampling rate?

(not the best idea) Adaptively super sample? Choose sample points based on

special criteria?

Forcing Functions Use a forcing function to only

sample action in a state when it is likely to be effective (valleys are ignored) Reduces variance in experienced

reward for state-action pair How do we create such a forcing

function?

Results Evaluate three systems

Control – Random action selection SARSA Forcing Function

Evaluation criteria Goals scored Time of possession

Cumulative ScoreSARSA vs. Random Policy

0

100

200

300

400

500

600

700

800

900

1 55 109

163

217

271

325

379

433

487

541

595

649

703

757

811

865

919

Games Played

Cu

mu

lati

ve G

oal

s S

core

d

Learning Team

Random Team

Time of PossessionTime of Possession

0

1000

2000

3000

4000

5000

6000

1 60 119

178

237

296

355

414

473

532

591

650

709

768

827

886

945

Games Played

Tim

e o

f P

oss

essi

on

Time of Possession

Team with Forcing Functions

SARSA with Forcing Function vs. Random Policy

0

200

400

600

800

1000

12001 65 129

193

257

321

385

449

513

577

641

705

769

833

897

Games Played

Cu

mu

lati

ve S

core

Learning Team with ForcingFunctions

Random Team Against Teamwith Forcing Functions

With Forcing vs. WithoutPerformance With Forcing Functions vs Performance Without Forcing Functions

0

200

400

600

800

1000

1200

1 53 105

157

209

261

313

365

417

469

521

573

625

677

729

781

833

885

937

Games Played

Cu

mu

lati

ve S

core

Learning Team Without ForcingFunctions

Random Team Against Team WithoutForcing Functions

Learning Team with Forcing Functions

Random Team Against Team withForcing Functions

Summary Two-Pass learning algorithm for

simulated soccer State space abstraction is automated Data driven technique

Improved state of the art for simulated soccer

Future Work Learned distance metric

Additional automation in process Better generalization

Documents

Reinforcement Learning in Simulated Soccer with Kohonen Networks Chris White and David Brogan University of Virginia Department of Computer Science