Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Reinforcement Learning
Lecture 1: Introduction
Vien NgoMLR, University of Stuttgart
What is Reinforcement Learning?– Reinforcement Learning is a subfield of Machine Learning
adapted from David Silver’s lecture2/18
RL: A subfield of Machine Learning(from Machine Learning course, 2011, Marc Toussaint)
• Supervised learning: learn from “labelled” data {(xi, yi)}Ni=1
Unsupervised learning: learn from “unlabelled” data {xi}Ni=0 onlySemi-supervised learning: many unlabelled data, few labelled data
• Reinforcement learning: learn from data {(st, at, rt, st+1)}– learn a predictive model (s, a) 7→ s′
– learn to predict reward (s, a) 7→ r
– learn a behavior s 7→ a that maximizes the expected total reward
3/18
RL: A subfield of Machine Learning(from Machine Learning course, 2011, Marc Toussaint)
• Supervised learning: learn from “labelled” data {(xi, yi)}Ni=1
Unsupervised learning: learn from “unlabelled” data {xi}Ni=0 onlySemi-supervised learning: many unlabelled data, few labelled data
• Reinforcement learning: learn from data {(st, at, rt, st+1)}– learn a predictive model (s, a) 7→ s′
– learn to predict reward (s, a) 7→ r
– learn a behavior s 7→ a that maximizes the expected total reward
3/18
What is Reinforcement Learning?– RL is learning from interaction.– There is no supervisor, only signals of reward/evaluative feedback.– Decisions in sequence does matter as they affect the outcome ofsubsequent decisions.
from Satinder Singh’s Introduction to RL
4/18
What is Reinforcement Learning?
from Satinder Singh’s Introduction to RL
5/18
Success of Reinforcement Learning
• Games– Backgammon (Tesauro, 1994)– Solitaire (X. Yan et. al., 2005)– Chess,– Checkers,– deep RL in playing Atari games (2014, Google Deepmind).
• Operations Research– Inventory Management (Van Roy, Bertsekas, Lee, & Tsitsiklis, 1996)– Dynamic Channel Allocation (e.g. Singh & Bertsekas, 1997)– Vehicle Routing, etc.
• Economics– Trading,
• Robotics– Robocup Soccer (e.g. Stone & Veloso, 1999)– Helicopter Control (e.g. Ng, 2003, Abbeel & Ng, 2006)– Many Robots (navigation, bi-pedal walking, grasping, switching betweenskills, ...)
6/18
TD-Gammon, by Gerald Tesauro(See section 11.1 in Sutton & Barto’s book.)
• See (Tesauro, 1992, 1994, 1995)
• Only reward given at end of game for win.
• Self-play: use the current policy to sample moves on both sides!
• After about 300,000 games against itself, near the level of the world’sstrongest grandmasters.
7/18
GO using UCT, by Gelly(See Gelly et. al 2012, Communications of the ACM for a review.)
8/18
Reinfocement Learning in RoboticsLearning motor skills, Autonomous Helicopter Flight
(2000, by Schaal, Atkeson, Vijayakumar) (2007, Andrew Ng et al.)
(2014, playing Atari games by Google
Deepmind)
9/18
Reinforcement learning in neuroscience
(Yael Niv, ICML 2009’s tutorial.)
10/18
Reinforcement learning in neurosciencePeter Dayan and Yael Niv, Neurobiology 2008.
• The brain employs both model-free and model-based decision-makingstrategies in parallel, with each dominating in different circumstances.
11/18
What is Reinforcement Learning?
s1a1r2s2a2r2 · · · siairi+1si+1 · · ·
• States can be vectors or other structures, defined as sufficientstatistics to predict what happens next.
• Actions/Controls can be multi-dimensional
• Rewards are scalar but can be arbitrarily uninformative, and might bedelayed; e.g., rt tells how well the agent does at time t (after takingaction at at st).
• Objective: is desribed as the maximization of expected total reward.
• States are sometimes not directly observable, unobservable.
o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·
• Agent has only partial knowledge about environment, e.g unknowndynamics, reward, observation functions, etc..
12/18
What is Reinforcement Learning?
s1a1r2s2a2r2 · · · siairi+1si+1 · · ·
• States can be vectors or other structures, defined as sufficientstatistics to predict what happens next.
• Actions/Controls can be multi-dimensional
• Rewards are scalar but can be arbitrarily uninformative, and might bedelayed; e.g., rt tells how well the agent does at time t (after takingaction at at st).
• Objective: is desribed as the maximization of expected total reward.
• States are sometimes not directly observable, unobservable.
o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·
• Agent has only partial knowledge about environment, e.g unknowndynamics, reward, observation functions, etc..
12/18
What is Reinforcement Learning?
s1a1r2s2a2r2 · · · siairi+1si+1 · · ·
• States can be vectors or other structures, defined as sufficientstatistics to predict what happens next.
• Actions/Controls can be multi-dimensional
• Rewards are scalar but can be arbitrarily uninformative, and might bedelayed; e.g., rt tells how well the agent does at time t (after takingaction at at st).
• Objective: is desribed as the maximization of expected total reward.
• States are sometimes not directly observable, unobservable.
o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·
• Agent has only partial knowledge about environment, e.g unknowndynamics, reward, observation functions, etc..
12/18
What is Reinforcement Learning?
s1a1r2s2a2r2 · · · siairi+1si+1 · · ·
• States can be vectors or other structures, defined as sufficientstatistics to predict what happens next.
• Actions/Controls can be multi-dimensional
• Rewards are scalar but can be arbitrarily uninformative, and might bedelayed; e.g., rt tells how well the agent does at time t (after takingaction at at st).
• Objective: is desribed as the maximization of expected total reward.
• States are sometimes not directly observable, unobservable.
o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·
• Agent has only partial knowledge about environment, e.g unknowndynamics, reward, observation functions, etc.. 12/18
What is Reinforcement Learning?– Example of Rewards:
• +1/− 1 of winning/losing a game, e.g. GO, Backgammon, ...• +/− for increasing/decreasing score, e.g. in deep RL algorithms playing
Atari games.• +/− rewards for earning/losing money in managing an investment
portfolio.• +/− rewards for following the desired trajectory/for crashing in controlling
a stunt helicopter.• etc.
13/18
Components of An RL Agent– Policy: define behaviours of the agent, e.g a mapping π : S 7→ A orπ : S ×A 7→ [0, 1]
– Value Functions: the expected return from this state (if starting fromthis state).
V π(s) = Eπ[∑
t
γtRt|s0 = s]
– Model: the agent’s internal representation of the environment, e.g.P (s′|s, a), R(s, a, s′) .
14/18
Schedule of this course
• Part 1: The Basis
• Markov Decision Process (MDP), Partially Observable MDP (POMDP).• Dynamic Programming: Value Iteration, Policy Iteration
• Part 2: Reinforcement Learning Topics
• Temporal Difference learning, Q-Learning.• Reinforcement learning with function approximation• Policy search
• Part 3: Advanced Topics
• Inverse reinforcement learning, imitation learning.• Exploration vs. Exploitation: Multi-armed bandis, PAC-MDP, Bayesian
reinforcement learning.• Hierarchical reinforcement learning: macro actions, skill acquisition.• Deep reinforcement learning• Reinforcement learning in POMDP environment.
15/18
Schedule of this course
• Missing:– Relational MDP– MDP/POMDP/RL as Inference
16/18
Literature
Richard S. Sutton, Andrew Barto: Rein-forcement Learning: An Introduction. TheMIT Press Cambridge, MassachusettsLondon, England, 1998.http://webdocs.cs.ualberta.ca/
~sutton/book/the-book.html
Csaba Szepesvri: Algorithms for Rein-forcement Learning. Morgan & Claypoolin July 2010.http://www.ualberta.ca/~szepesva/
RLBook.html
17/18
Organisation• Course webpage::https://ipvs.informatik.uni-stuttgart.de/mlr/reinforcement-learning-ss15/
– Slides, Exercises– Links to other resources
• Secretary, admin issuesCarola Stahl, [email protected], Raum 2.217
• Lecture : Tue. 09:45-11:15, Room 0.124;• Tutorial: Wed. 14:00-15:30, Room 0.108
• Rules for the tutorials:– Doing the exercises is crucial!– At the beginning of each tutorial:
– sign into a list– mark which exercises you have (successfully) worked on
– Students are randomly selected to present their solutions– You need 50% of completed exercises to be allowed to the exam
(Prof. Marc Toussaint’s rules.)18/18