Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Decision MakingDecision Making in Robots and Autonomous Agentsin Robots and Autonomous Agents
Introduction
Subramanian Ramamoorthy School of Informatics
15 January, 2013
What is a Robot/Autonomous Agent?
Environment
Perception
Action
Adversarial actions & other agents
Adversarial actions & other agents
High-level goals
Problem: How to generate actions, to achieve high-level goals, using limited
perception and incomplete knowledge of environment & adversarial actions?
15/01/2013 2
Robot Cars
• http://www.youtube.com/watch?gl=GB&v=1W27Q6YvTXc
15/01/2013 3
Rescue Robots
• http://www.youtube.com/watch?v=F7lqriYKsX4
15/01/2013 4
RoboCup
• http://www.youtube.com/watch?v=9HqVe4GHV9M
15/01/2013 5
What Makes Robotics Problems Hard?
One more thing…One more thing…
15/01/2013 6
What Makes Robotics Problems Hard?
15/01/2013 7
What Happens if You Plug in RealReal People?
15/01/2013 8
Levels of Difficulty in Interaction
Consequences for hardnesshardness of learningof learning:
1. Base case: spatial asymmetry
– Learn a vector field
2. Next level: deal with reactive behaviour
– ‘Inverse’ planning, plan recognition
3. Harder case: recursive exchange of beliefs (e.g., signaling, implicit coordination, trust, persuasion)
– Need to model as a game?
15/01/2013 9
Difficulty of Interaction
http://www.youtube.com/watch?v=HacG_FWWPOw
15/01/2013 10
Application Example: Task Allocation (Kiva)
15/01/2013 11
[Source: boston.com]
Application Example: Persuasive Technology (Beeminder)
15/01/2013 12
In this course…
We will focus on how to model and compute decisions (choices), • over time, under uncertainty, with incompleteness in models • emphasizing interactive settings • including methods for learning from experience and data. • Also, we care about capturing how real people make choices! Themes: 1. Decision theory and sequential decision making models, e.g., MDP 2. Strategic games 3. Learning in games, e.g., to achieve equilibria 4. Mechanism design and learning in that setting 5. Behavioural issues in decision making
15/01/2013 13
Course Structure
• Schedule of lectures is available at the course web site
http://www.inf.ed.ac.uk/teaching/courses/dmr/
I will attempt to upload slides the day before (except in first week)
• Two homework assignments
– Pen-and-paper exercise on models, concepts, methods (20%)
– Practical programming exercise in a simple domain (20%)
• Final Exam (60% of final mark)
• Resources:
– No prescribed textbook
– Suggested readings (books) listed in course web site
– Additional readings from research literature to be uploaded
15/01/2013 14
Ask Questions!
– During the lecture
– After class, if your questions are brief
– After hours, by prior appointment only (send me email)
• Be aware of Informatics Forum schedule for teaching activities (https://wiki.inf.ed.ac.uk/Vademecum/InformaticsForum)
15/01/2013 15
Teaser: Secretary Problem
• Choose one secretary from n applicants
• Applicants interviewed sequentially in random order, each order being equally likely.
• Assume you can rank all applicants without ties. The decision to accept or reject must be based solely on the relative ranking of applicants interviewed so far. If you reach the final applicant, you are forced to hire that person.
• All decisions are instantaneous and final - an applicant already rejected can’t be reconsidered.
• Your decision criterion is to maximize the quality of the chosen candidate.
15/01/2013 16
Solution to the Secretary Problem
• Wait until you have seen the first n/e candidates and then pick the best one after that
• Why? – Interesting reading: T. Ferguson, Who solved the secretary problem?,
Statistical Science 4(3): 282 - 289, 1989.
15/01/2013 17
A Decision Making Agent
We will be interested in the problem of devising complete agents:
• Temporally situated
• Continual learning and planning
• Objective is to affect the environment – actions and states
• Environment: uncertain, time-varying, other strategic agents
Environment
action state
reward Agent
18 15/01/2013
Simple Choices: Multi-arm Bandits (MAB)
• N possible actions
• You can play for some period of time and you want to maximize reward (expected utility)
Which is the best arm/ machine?
DEMO
19 15/01/2013
Real-Life Version
• Choose the best content to display to the next visitor of your commercial website
• Content options = slot machines
• Reward = user's response (e.g., click on an ad)
• Also, clinical trials: arm = treatment, reward = patient cured
• Simplifying assumption: no context (no visitor profiles). In practice, we want to solve contextual bandit problems but that is for some later discussion.
20 15/01/2013
What is the Choice?
21 15/01/2013
n-Armed Bandit Problem
• Choose repeatedly from one of n actions; each choice is called a play
• After each play at , you get a reward rt , where
These are unknown action values
Distribution of depends only on rt at
Objective is to maximize the reward in the long term, e.g., over 1000 plays
To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them
22 15/01/2013
Exploration/Exploitation Dilemma
• Suppose you form estimates
• The greedy action at time t is at*
• You can’t exploit all the time; you can’t explore all the time
• You can never stop exploring; but you could reduce exploring.
Qt(a) Q*(a) action value estimates
at* argmax
aQt(a)
at at*
exploitation
at at* exploration
23 15/01/2013
Why?
Action-Value Learning
• Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen ka times, producing rewards r1 , r2 , …, rka
, then
“sample average”
limk a
Qt(a) Q*(a)
24 15/01/2013
Remark
• The simple greedy action selection strategy:
• Why might this above be insufficient?
• You are estimating, online, from a few samples. How will this behave?
DEMO
25 15/01/2013
-Greedy Action Selection
• Greedy action selection:
• -Greedy:
at at*
arg maxaQt(a)
at* with probability 1
random action with probability { at
. . . the simplest way to balance exploration and exploitation
26 15/01/2013
Worked Example: 10-Armed Testbed
• n = 10 possible actions
• Each is chosen randomly from a normal distrib.:
• Each is also normal:
• 1000 plays, repeat the whole thing 2000 times and average
the results
rt
Q*(a) )1,0(N
)1),(( *
taQN
27 15/01/2013
-Greedy Methods on the 10-Armed Testbed
28 15/01/2013
Softmax Action Selection
• Softmax action selection methods grade action probabilities by estimated values.
• The most common softmax uses a Gibbs, or Boltzmann, distribution:
re' temperatunalcomputatio' a is where
,
yprobabilit with play on action Choose
1
)(
)(
n
b
bQ
aQ
t
t
e
e
ta
29 15/01/2013
Incremental Implementation
Qkr1 r2 rk
k
Sample average estimation method:
How to do this incrementally (without storing all the rewards)?
We could keep a running sum and count, or, equivalently:
Qk 1 Qk1
k 1rk 1 Qk
The average of the first k rewards is (dropping the dependence on a ):
NewEstimate = OldEstimate + StepSize [Target – OldEstimate]
30 15/01/2013
Tracking a Nonstationary Problem
Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem.
Qk
Q*(a)
Better in the nonstationary case is:
Qk 1 Qk rk 1 Qk
for constant , 0 1
(1 )kQ0 (1
i 1
k
)k iri
exponential, recency-weighted average
31 15/01/2013
Optimistic Initial Values
• All methods so far depend on , i.e., they are biased
• Encourage exploration: initialize the action values optimistically,
Q0 (a)
i.e., on the 10-armed testbed, use Q0 (a) 5 for all a
32 15/01/2013
Bandits are Everywhere!
15/01/2013 33
Beyond the MAB Model
• In this lecture, we are in a single casino and the only decision is to pull from a set of n arms – except perhaps in the very last slides, exactly one state!
Some questions to ponder before upcoming lectures,
• What if there is more than one state?
• So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms?
• What happens if the other side of is a real human bandit who reacts to your play?
34 15/01/2013
Acknowledgements
• Some slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book
35 15/01/2013