Selective attention in RL B. Ravindran Joint work with Shravan Matthur, Vimal Mathew, Sanjay...

Preview:

Citation preview

Selective attention in RL

B. RavindranJoint work with Shravan Matthur, Vimal Mathew,

Sanjay Karanth, Andy Barto

Features, features, everywhere!

• We inhabit a world rich in sensory input

• Focus on features relevant to task at hand

Features, features, everywhere!

• We inhabit a world rich in sensory input

• Focus on features relevant to task at hand

• Two questions: 1. How do you characterize relevance?

2. How do you identify relevant features?

Outline

• Characterization of relevance– MDP homomorphisms– Factored Representations– Relativized Options

• Identifying features– Option schemas– Deictic Options

5

Markov Decision Processes• MDP, M, is the tuple:

– S : set of states.– A : set of actions.– : set of admissible state-action

pairs.– : probability of transition.– : expected reward.

• Policy • Maximize total expected reward

– optimal policy

M = S,A,Ψ,P,R

AS×⊆Ψ

P :Ψ×S→ 0,1[ ]

ℜ→Ψ:R

[ ]1,0 : →Ψπ

Notion of Equivalence

A,E( ) ≡ B,N( )

A,W( ) ≡ B,S( )

A,N( ) ≡ B,E( )

A,S( ) ≡ B,W( )

N

E

S

W

M = S,A,Ψ,P,R ′M = ′S , ′A ,Ψ ′ ,P ′ , ′R

Find reduced models that preserve some aspects of the original model

MDP Homomorphism

),( as ),( as

),( as ′′ ),( as ′′

saP

asP ′′′P′

P r

h hagg.

R

R′

MDPs M = S,A,Ψ,P,R , ′M = ′S , ′A , ′Ψ , ′P , ′R

surjection h :Ψ→ ′Ψ defined by h((s,a)) =( f (s),gs(a)) where:f :S→ ′S , gs :As→ ′Af (s), for all s∈S, are surjections such thatfor all s, s∈S, and a∈As:(1) ′P f(s), gs(a), f(s)( )= P s,a,t( )

t∈ s[ ] f∑

(2) ′R f(s), gs(a)( )=R s, a( )

Example

( ) ( )NB,EA, ≡

A,W( ) ≡ B,S( )

A,N( ) ≡ B,E( )

( ) ( )WB,SA, ≡

N

E

S

W

RPASM ,,,, Ψ= RPASM ′′′Ψ′′=′ , , ,,

)},,({ ),(),( EBANBhEAh ==

State dependent action recoding

Some Theoretical Results

• Optimal Value equivalence:

If then•

• Solve homomorphic image and lift the policy to the original MDP.

Q*(s,a) =Q* ( ′s , ′a ).h(s,a) = ( ′s , ′a )Corollary:

If h(s1,a1) =h(s2 ,a2 ) then Q∗(s1,a1) =Q

∗(s2 ,a2 ).

[generalizing those of Dean and Givan, 1997]

Theorem: If is a homomorphic image of , then a policy optimal in induces an optimal policy in .

′M M

M′M

More results

• Polynomial time algorithm to find reduced images Dean and Givan ’97, Lee and Yannakakis ’92, Ravindran ‘04

• Approximate homomorphisms Ravindran, Barto ‘04

– Bounds for the loss in the optimal value function• Soft homomorphisms Sorg, Singh ‘09

– Fuzzy notions of equivalence between two MDPs– Efficient algorithm for finding them

• Transfer learning (Soni, Singh et al ‘06), partial observability (Wolfe ‘10), etc.

Still more results

• Symmetries are special cases of homomorphisms Matthur Ravindran 08

– Finding symmetries is GI-complete– Harder than finding general reductions

• Efficient algorithms for constructing the reduced image Matthur Ravindran 07

– Factored MDPs– Polynomial in the size of the smaller MDP

Attention?

• How to use this concept for modeling attention?

• Combine with hierarchical RL– Look at sub-task specific relevance – Structured homomorphisms– Deixis (δεῖξις to point)

• State and action spaces defined as product of features/variables.

• Factor transition probabilities.• Exploit structure to define simple

transformations.

1x 1x′

2x

3x

2x′

3x′

r

),|( 211 xxxP ′

),|( 212 xxxP ′

),|( 323 xxxP ′

),|( 21 xxrP

2 Slice

Temporal

Bayes

Net

Factored MDPs

3x 3x′

Using Factored Representations

• Represent symmetry information in terms of features.– Eg: As an example the NE-SW symmetry can be

represented as

• Simple forms of transformations– Projections– Permutations

(x, y)−N ≡(y,x)−E& (x,y)−W≡(y,x)−S

Hierarchical Reinforcement Learning

Options frameworkOptions (Sutton, Precup, & Singh, 1999): A generalization of

actions to include temporally-extended courses of action

state each in gterminatin of yprobabilit the is

during followed policy c)(stochasti the is

started be may whichin states of set the is

triple a is option An

]1,0[

]10[

,,

→•→Ψ•

⊆•>=<

S:,:

SII

oo

o

o

o

βπ

βπ

Example: robot docking π : pre-defined controller

β : terminate when docked or charger not visible

I : all states in which charger is in sight

o

• Gather all the red objects

• Five options – one for each room

• Sub-goal options• Implicitly represent

option policy• Option MDPs

related to each other

Sub-goal Options

Relativized Options

• Relativized options (Ravindran and Barto ’03)

– Spatial abstraction - MDP Homomorphisms

– Temporal abstraction – options framework

• Abstract representation of a related family of sub-tasks – Each sub-task can be derived by applying

well defined transformations

Relativized Options (Cont)

Relativized option:

: Option homomorphism

: Option MDP (Image of h)

: Initiation set

: Termination criterion

O = h,MO, I ,β

I ⊆S

h

β :SO → [0,1]

OM

reduced state

actionoption

Top level

actions

perceptenv

• Single relativized option – get-object-exit-room

• Especially useful when learning option policy– Speed up– Knowledge transfer

• Terminology: Iba ’89 • Related to

parameterized sub-tasks (Dietterich ’00, Andre and Russell ’01, 02)

Rooms World Task

Option Schema

• Finding the right transformation?– Given a set of candidate transformations

• Option MDP and policy can be viewed as a policy schema (Schmidt ’75)

– Template of a policy– Acquire schema in a prototypical setting– Learn bindings of sensory inputs and

actions to schema

Problem Formulation

• Given:– of a relativized option– , a family of transformations

• Identify the option homomorphism • Formulate as a parameter estimation

problem– One parameter for each sub-task, takes

values from H– Samples: – Bayesian learning

MO , I ,βH

h

L,,,, 2211 asas

Algorithm• Assume uniform prior: • Experience:

• Update Posteriors:

),(0 shp

1,, +nnn sas

pn (h, s ) =PO f(sn),gsn (an), f(sn+1)( )⋅pn−1(h,s)

Normalizing Factor

P sn ,an , sn+1 h,s( ) =PO f(sn),gsn (an), f(sn+1)( )

Complex Game World

• Symmetric option MDP• One delayer • 40 transformations

– 8 spatial transformations combined with 5 projections

• Parameters of option MDP different from the rooms

ResultsSpeed of Convergence

• Learning the policy is more difficult than learning the correct transformation!

ResultsTransformation Weights in Room 4

• Transformation 12 eventually converges to 1

ResultsTransformation Weights in Room 2

• Weights oscillate a lot• Some transformation dominates eventually

– Changes from one run to another

27

Deictic Representation

• Making attention more explicit• Sense world via pointers – selective attention• Actions defined with respect to pointers• Agre ’88

– Game domain Pengo• Pointers can be arbitrarily complex

– ice-cube-next-to-me– robot-chasing-me

Move block to top of block .

Deixis and Abstraction• Deictic pointers project states and actions

onto some abstract state-action space• Consistent Representation (Whitehead and Ballard ’91)

– states with same abstract representation have the same optimal value.

– Lion algorithm, works with deterministic systems• Extend relativized options to model deictic

representation Ravindran Barto Mathew ‘07

– Factored MDPs– Restrict transformations available

• Only projections– Homomorphism conditions ensure consistent

representation

Deictic Option Schema• Deictic option schema:

– O - A relativized option– K - A set of deictic pointers– D - A collection of sets of possible projections, one

for each pointer

• Finding the correct transformation for a given state gives a consistent representation

• Use a factored version of a parameter estimation algorithm

ODK ,,

Classes of Pointers

• Independent pointers

• Mutually dependent pointers

• Dependent pointers

1x 1x′

2x

3x

2x′

3x′

1x 1x′

2x 2x′

Problem Formulation

• Given:– of a relativized option

• Identify the right pointer configuration for each sub-task

• Formulate as a parameter estimation problem– One parameter for each set of connected pointers per

sub-task – Takes values from – Samples: – Heuristic modification of Bayesian learning

ODK ,,

L,,,, 2211 asasK2

Heuristic Update Rule

wnl (h, s ) =

POl f(sn),gsn (an), f(sn+1)( )⋅wn−1

l (h,s)

Normalizing Factor

• Use a heuristic update rule:

where, POl (s,a, s ') =max ν,PO

l (s,a,s')( )

and ν is a small positive constant

Game Domain

• 2 deictic pointers: delayer and retriever• 2 fixed pointers: where-am-I and have-diamond• 8 possible values for delayer and retriever

– 64 transformations

Experimental Setup• Composite Agent

– Uses 64 transformations and a single component weight vector

• Deictic Agent– Uses 2 component weight vector– 8 transformations per component

• Hierarchical SMDP Q-learning

Experimental Results – Speed of Convergence

Experimental Results – Timing

• Execution Time

– Composite – 14 hours and 29 minutes

– Factored – 4 hours and 16 minutes

Experimental Results – Composite Weights

Mean 2006Std. Dev. 1673

Experimental Results – Delayer Weights

Mean 52Std. Dev. 28.21

Experimental Results – Retriever Weights

Mean 3045Std. Dev. 2332.42

Robosoccer• Hard to learn a polciy

for the entire game• Look at simpler

problems– Keepaway– Half-field offence

• Learn policy in a relative frame of reference

• Keep changing the bindings

Summary

• Richer representations needed for RL– Deictic representations

• Deictic Option Schemas– Combines ideas of hierarchical RL and

MDP homomorphisms– Captures aspects of deictic representation

Future

• Build an integrated cognitive architecture, that uses both bottom-up and top-down information.– In perception– In decision making

• Combine aspects of supervised, unsupervised, reinforcement learning and planning approaches

Recommended