35
Decision Making Decision Making in Robots and Autonomous Agents in Robots and Autonomous Agents Introduction Subramanian Ramamoorthy School of Informatics 15 January, 2013

Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Decision MakingDecision Making in Robots and Autonomous Agentsin Robots and Autonomous Agents

Introduction

Subramanian Ramamoorthy School of Informatics

15 January, 2013

Page 2: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

What is a Robot/Autonomous Agent?

Environment

Perception

Action

Adversarial actions & other agents

Adversarial actions & other agents

High-level goals

Problem: How to generate actions, to achieve high-level goals, using limited

perception and incomplete knowledge of environment & adversarial actions?

15/01/2013 2

Page 3: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Robot Cars

• http://www.youtube.com/watch?gl=GB&v=1W27Q6YvTXc

15/01/2013 3

Page 4: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Rescue Robots

• http://www.youtube.com/watch?v=F7lqriYKsX4

15/01/2013 4

Page 5: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

RoboCup

• http://www.youtube.com/watch?v=9HqVe4GHV9M

15/01/2013 5

Page 6: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

What Makes Robotics Problems Hard?

One more thing…One more thing…

15/01/2013 6

Page 7: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

What Makes Robotics Problems Hard?

15/01/2013 7

Page 8: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

What Happens if You Plug in RealReal People?

15/01/2013 8

Page 9: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Levels of Difficulty in Interaction

Consequences for hardnesshardness of learningof learning:

1. Base case: spatial asymmetry

– Learn a vector field

2. Next level: deal with reactive behaviour

– ‘Inverse’ planning, plan recognition

3. Harder case: recursive exchange of beliefs (e.g., signaling, implicit coordination, trust, persuasion)

– Need to model as a game?

15/01/2013 9

Page 10: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Difficulty of Interaction

http://www.youtube.com/watch?v=HacG_FWWPOw

15/01/2013 10

Page 11: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Application Example: Task Allocation (Kiva)

15/01/2013 11

[Source: boston.com]

Page 12: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Application Example: Persuasive Technology (Beeminder)

15/01/2013 12

Page 13: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

In this course…

We will focus on how to model and compute decisions (choices), • over time, under uncertainty, with incompleteness in models • emphasizing interactive settings • including methods for learning from experience and data. • Also, we care about capturing how real people make choices! Themes: 1. Decision theory and sequential decision making models, e.g., MDP 2. Strategic games 3. Learning in games, e.g., to achieve equilibria 4. Mechanism design and learning in that setting 5. Behavioural issues in decision making

15/01/2013 13

Page 14: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Course Structure

• Schedule of lectures is available at the course web site

http://www.inf.ed.ac.uk/teaching/courses/dmr/

I will attempt to upload slides the day before (except in first week)

• Two homework assignments

– Pen-and-paper exercise on models, concepts, methods (20%)

– Practical programming exercise in a simple domain (20%)

• Final Exam (60% of final mark)

• Resources:

– No prescribed textbook

– Suggested readings (books) listed in course web site

– Additional readings from research literature to be uploaded

15/01/2013 14

Page 15: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Ask Questions!

– During the lecture

– After class, if your questions are brief

– After hours, by prior appointment only (send me email)

• Be aware of Informatics Forum schedule for teaching activities (https://wiki.inf.ed.ac.uk/Vademecum/InformaticsForum)

15/01/2013 15

Page 16: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Teaser: Secretary Problem

• Choose one secretary from n applicants

• Applicants interviewed sequentially in random order, each order being equally likely.

• Assume you can rank all applicants without ties. The decision to accept or reject must be based solely on the relative ranking of applicants interviewed so far. If you reach the final applicant, you are forced to hire that person.

• All decisions are instantaneous and final - an applicant already rejected can’t be reconsidered.

• Your decision criterion is to maximize the quality of the chosen candidate.

15/01/2013 16

Page 17: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Solution to the Secretary Problem

• Wait until you have seen the first n/e candidates and then pick the best one after that

• Why? – Interesting reading: T. Ferguson, Who solved the secretary problem?,

Statistical Science 4(3): 282 - 289, 1989.

15/01/2013 17

Page 18: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

A Decision Making Agent

We will be interested in the problem of devising complete agents:

• Temporally situated

• Continual learning and planning

• Objective is to affect the environment – actions and states

• Environment: uncertain, time-varying, other strategic agents

Environment

action state

reward Agent

18 15/01/2013

Page 19: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Simple Choices: Multi-arm Bandits (MAB)

• N possible actions

• You can play for some period of time and you want to maximize reward (expected utility)

Which is the best arm/ machine?

DEMO

19 15/01/2013

Page 20: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Real-Life Version

• Choose the best content to display to the next visitor of your commercial website

• Content options = slot machines

• Reward = user's response (e.g., click on an ad)

• Also, clinical trials: arm = treatment, reward = patient cured

• Simplifying assumption: no context (no visitor profiles). In practice, we want to solve contextual bandit problems but that is for some later discussion.

20 15/01/2013

Page 21: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

What is the Choice?

21 15/01/2013

Page 22: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

n-Armed Bandit Problem

• Choose repeatedly from one of n actions; each choice is called a play

• After each play at , you get a reward rt , where

These are unknown action values

Distribution of depends only on rt at

Objective is to maximize the reward in the long term, e.g., over 1000 plays

To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them

22 15/01/2013

Page 23: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Exploration/Exploitation Dilemma

• Suppose you form estimates

• The greedy action at time t is at*

• You can’t exploit all the time; you can’t explore all the time

• You can never stop exploring; but you could reduce exploring.

Qt(a) Q*(a) action value estimates

at* argmax

aQt(a)

at at*

exploitation

at at* exploration

23 15/01/2013

Why?

Page 24: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Action-Value Learning

• Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen ka times, producing rewards r1 , r2 , …, rka

, then

“sample average”

limk a

Qt(a) Q*(a)

24 15/01/2013

Page 25: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Remark

• The simple greedy action selection strategy:

• Why might this above be insufficient?

• You are estimating, online, from a few samples. How will this behave?

DEMO

25 15/01/2013

Page 26: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

-Greedy Action Selection

• Greedy action selection:

• -Greedy:

at at*

arg maxaQt(a)

at* with probability 1

random action with probability { at

. . . the simplest way to balance exploration and exploitation

26 15/01/2013

Page 27: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Worked Example: 10-Armed Testbed

• n = 10 possible actions

• Each is chosen randomly from a normal distrib.:

• Each is also normal:

• 1000 plays, repeat the whole thing 2000 times and average

the results

rt

Q*(a) )1,0(N

)1),(( *

taQN

27 15/01/2013

Page 28: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

-Greedy Methods on the 10-Armed Testbed

28 15/01/2013

Page 29: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Softmax Action Selection

• Softmax action selection methods grade action probabilities by estimated values.

• The most common softmax uses a Gibbs, or Boltzmann, distribution:

re' temperatunalcomputatio' a is where

,

yprobabilit with play on action Choose

1

)(

)(

n

b

bQ

aQ

t

t

e

e

ta

29 15/01/2013

Page 30: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Incremental Implementation

Qkr1 r2 rk

k

Sample average estimation method:

How to do this incrementally (without storing all the rewards)?

We could keep a running sum and count, or, equivalently:

Qk 1 Qk1

k 1rk 1 Qk

The average of the first k rewards is (dropping the dependence on a ):

NewEstimate = OldEstimate + StepSize [Target – OldEstimate]

30 15/01/2013

Page 31: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Tracking a Nonstationary Problem

Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem.

Qk

Q*(a)

Better in the nonstationary case is:

Qk 1 Qk rk 1 Qk

for constant , 0 1

(1 )kQ0 (1

i 1

k

)k iri

exponential, recency-weighted average

31 15/01/2013

Page 32: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Optimistic Initial Values

• All methods so far depend on , i.e., they are biased

• Encourage exploration: initialize the action values optimistically,

Q0 (a)

i.e., on the 10-armed testbed, use Q0 (a) 5 for all a

32 15/01/2013

Page 33: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Bandits are Everywhere!

15/01/2013 33

Page 34: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Beyond the MAB Model

• In this lecture, we are in a single casino and the only decision is to pull from a set of n arms – except perhaps in the very last slides, exactly one state!

Some questions to ponder before upcoming lectures,

• What if there is more than one state?

• So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms?

• What happens if the other side of is a real human bandit who reacts to your play?

34 15/01/2013

Page 35: Structure and Synthesis of Robot Motion Introduction · Application Example: Task Allocation (Kiva) 15/01/2013 11 [Source: boston.com] Application Example: Persuasive Technology (Beeminder)

Acknowledgements

• Some slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book

35 15/01/2013