Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Decision MakingDecision Making in Robots and Autonomous Agentsin Robots and Autonomous Agents

Introduction

Subramanian Ramamoorthy School of Informatics

15 January, 2013

What is a Robot/Autonomous Agent?

Environment

Perception

Action

Adversarial actions & other agents

Adversarial actions & other agents

High-level goals

Problem: How to generate actions, to achieve high-level goals, using limited

perception and incomplete knowledge of environment & adversarial actions?

15/01/2013 2

Robot Cars

• http://www.youtube.com/watch?gl=GB&v=1W27Q6YvTXc

15/01/2013 3

http://www.youtube.com/watch?gl=GB&v=1W27Q6YvTXc

http://www.youtube.com/watch?gl=GB&v=1W27Q6YvTXc

Rescue Robots

• http://www.youtube.com/watch?v=F7lqriYKsX4

15/01/2013 4

http://www.youtube.com/watch?v=F7lqriYKsX4

RoboCup

• http://www.youtube.com/watch?v=9HqVe4GHV9M

15/01/2013 5

http://www.youtube.com/watch?v=9HqVe4GHV9M

What Makes Robotics Problems Hard?

One more thing…One more thing…

15/01/2013 6

What Makes Robotics Problems Hard?

15/01/2013 7

What Happens if You Plug in RealReal People?

15/01/2013 8

Levels of Difficulty in Interaction

Consequences for hardnesshardness of learningof learning:

1. Base case: spatial asymmetry

– Learn a vector field

2. Next level: deal with reactive behaviour

– ‘Inverse’ planning, plan recognition

3. Harder case: recursive exchange of beliefs (e.g., signaling, implicit coordination, trust, persuasion)

– Need to model as a game?

15/01/2013 9

Difficulty of Interaction

http://www.youtube.com/watch?v=HacG_FWWPOw

15/01/2013 10

http://www.youtube.com/watch?v=HacG_FWWPOw

Application Example: Task Allocation (Kiva)

15/01/2013 11

[Source: boston.com]

Application Example: Persuasive Technology (Beeminder)

15/01/2013 12

In this course…

We will focus on how to model and compute decisions (choices), • over time, under uncertainty, with incompleteness in models • emphasizing interactive settings • including methods for learning from experience and data. • Also, we care about capturing how real people make choices! Themes: 1. Decision theory and sequential decision making models, e.g., MDP 2. Strategic games 3. Learning in games, e.g., to achieve equilibria 4. Mechanism design and learning in that setting 5. Behavioural issues in decision making

15/01/2013 13

Course Structure

• Schedule of lectures is available at the course web site

http://www.inf.ed.ac.uk/teaching/courses/dmr/

I will attempt to upload slides the day before (except in first week)

• Two homework assignments

– Pen-and-paper exercise on models, concepts, methods (20%)

– Practical programming exercise in a simple domain (20%)

• Final Exam (60% of final mark)

• Resources:

– No prescribed textbook

– Suggested readings (books) listed in course web site

– Additional readings from research literature to be uploaded

15/01/2013 14

http://www.inf.ed.ac.uk/teaching/courses/dmr/

Ask Questions!

– During the lecture

– After class, if your questions are brief

– After hours, by prior appointment only (send me email)

• Be aware of Informatics Forum schedule for teaching activities (https://wiki.inf.ed.ac.uk/Vademecum/InformaticsForum)

15/01/2013 15

https://wiki.inf.ed.ac.uk/Vademecum/InformaticsForum

Teaser: Secretary Problem

• Choose one secretary from n applicants

• Applicants interviewed sequentially in random order, each order being equally likely.

• Assume you can rank all applicants without ties. The decision to accept or reject must be based solely on the relative ranking of applicants interviewed so far. If you reach the final applicant, you are forced to hire that person.

• All decisions are instantaneous and final - an applicant already rejected can’t be reconsidered.

• Your decision criterion is to maximize the quality of the chosen candidate.

15/01/2013 16

Solution to the Secretary Problem

• Wait until you have seen the first n/e candidates and then pick the best one after that

• Why? – Interesting reading: T. Ferguson, Who solved the secretary problem?,

Statistical Science 4(3): 282 - 289, 1989.

15/01/2013 17

A Decision Making Agent

We will be interested in the problem of devising complete agents:

• Temporally situated

• Continual learning and planning

• Objective is to affect the environment – actions and states

• Environment: uncertain, time-varying, other strategic agents

Environment

action state

reward Agent

18 15/01/2013

Simple Choices: Multi-arm Bandits (MAB)

• N possible actions

• You can play for some period of time and you want to maximize reward (expected utility)

Which is the best arm/ machine?

DEMO

19 15/01/2013

Real-Life Version

• Choose the best content to display to the next visitor of your commercial website

• Content options = slot machines

• Reward = user's response (e.g., click on an ad)

• Also, clinical trials: arm = treatment, reward = patient cured

• Simplifying assumption: no context (no visitor profiles). In practice, we want to solve contextual bandit problems but that is for some later discussion.

20 15/01/2013

What is the Choice?

21 15/01/2013

n-Armed Bandit Problem

• Choose repeatedly from one of n actions; each choice is called a play

• After each play at , you get a reward rt , where

These are unknown action values

Distribution of depends only on rt at

Objective is to maximize the reward in the long term, e.g., over 1000 plays

To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them

22 15/01/2013

Exploration/Exploitation Dilemma

• Suppose you form estimates

• The greedy action at time t is at*

• You can’t exploit all the time; you can’t explore all the time

• You can never stop exploring; but you could reduce exploring.

Qt(a) Q*(a) action value estimates

at* argmax

aQt(a)

at at*

exploitation

at at* exploration

23 15/01/2013

Why?

Action-Value Learning

• Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen ka times, producing rewards r1 , r2 , …, rka

, then

“sample average”

limk a

Qt(a) Q*(a)

24 15/01/2013

Remark

• The simple greedy action selection strategy:

• Why might this above be insufficient?

• You are estimating, online, from a few samples. How will this behave?

DEMO

25 15/01/2013

-Greedy Action Selection

• Greedy action selection:

• -Greedy:

at at*

arg maxaQt(a)

at* with probability 1

random action with probability { at

. . . the simplest way to balance exploration and exploitation

26 15/01/2013

Worked Example: 10-Armed Testbed

• n = 10 possible actions

• Each is chosen randomly from a normal distrib.:

• Each is also normal:

• 1000 plays, repeat the whole thing 2000 times and average

the results

rt

Q*(a) )1,0(N

)1),(( *

taQN

27 15/01/2013

-Greedy Methods on the 10-Armed Testbed

28 15/01/2013

Softmax Action Selection

• Softmax action selection methods grade action probabilities by estimated values.

• The most common softmax uses a Gibbs, or Boltzmann, distribution:

re' temperatunalcomputatio' a is where

,

yprobabilit with play on action Choose

1

)(

)(

n

b

bQ

aQ

t

t

e

e

ta

29 15/01/2013

Incremental Implementation

Qkr1 r2 rk

k

Sample average estimation method:

How to do this incrementally (without storing all the rewards)?

We could keep a running sum and count, or, equivalently:

Qk 1 Qk1

k 1rk 1 Qk

The average of the first k rewards is (dropping the dependence on a ):

NewEstimate = OldEstimate + StepSize [Target – OldEstimate]

30 15/01/2013

Tracking a Nonstationary Problem

Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem.

Qk

Q*(a)

Better in the nonstationary case is:

Qk 1 Qk rk 1 Qk

for constant , 0 1

(1 )kQ0 (1

i 1

k

)k iri

exponential, recency-weighted average

31 15/01/2013

Optimistic Initial Values

• All methods so far depend on , i.e., they are biased

• Encourage exploration: initialize the action values optimistically,

Q0 (a)

i.e., on the 10-armed testbed, use Q0 (a) 5 for all a

32 15/01/2013

Bandits are Everywhere!

15/01/2013 33

Beyond the MAB Model

• In this lecture, we are in a single casino and the only decision is to pull from a set of n arms – except perhaps in the very last slides, exactly one state!

Some questions to ponder before upcoming lectures,

• What if there is more than one state?

• So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms?

• What happens if the other side of is a real human bandit who reacts to your play?

34 15/01/2013

Acknowledgements

• Some slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book

35 15/01/2013

Documents

Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)