Reinforcement Learning Lecture Lecture 1: Introduction › mlr › wp... · What is Reinforcement Learning? – Reinforcement Learning is a subﬁeld of Machine Learning adapted from

Reinforcement Learning

Lecture 1: Introduction

Vien NgoMLR, University of Stuttgart

What is Reinforcement Learning?– Reinforcement Learning is a subfield of Machine Learning

adapted from David Silver’s lecture2/18

RL: A subfield of Machine Learning(from Machine Learning course, 2011, Marc Toussaint)

• Supervised learning: learn from “labelled” data {(xi, yi)}Ni=1

Unsupervised learning: learn from “unlabelled” data {xi}Ni=0 onlySemi-supervised learning: many unlabelled data, few labelled data

• Reinforcement learning: learn from data {(st, at, rt, st+1)}– learn a predictive model (s, a) 7→ s′

– learn to predict reward (s, a) 7→ r

– learn a behavior s 7→ a that maximizes the expected total reward

3/18

RL: A subfield of Machine Learning(from Machine Learning course, 2011, Marc Toussaint)

• Supervised learning: learn from “labelled” data {(xi, yi)}Ni=1

Unsupervised learning: learn from “unlabelled” data {xi}Ni=0 onlySemi-supervised learning: many unlabelled data, few labelled data

• Reinforcement learning: learn from data {(st, at, rt, st+1)}– learn a predictive model (s, a) 7→ s′

– learn to predict reward (s, a) 7→ r

– learn a behavior s 7→ a that maximizes the expected total reward

3/18

What is Reinforcement Learning?– RL is learning from interaction.– There is no supervisor, only signals of reward/evaluative feedback.– Decisions in sequence does matter as they affect the outcome ofsubsequent decisions.

from Satinder Singh’s Introduction to RL

4/18

What is Reinforcement Learning?

from Satinder Singh’s Introduction to RL

5/18

Success of Reinforcement Learning

• Games– Backgammon (Tesauro, 1994)– Solitaire (X. Yan et. al., 2005)– Chess,– Checkers,– deep RL in playing Atari games (2014, Google Deepmind).

• Operations Research– Inventory Management (Van Roy, Bertsekas, Lee, & Tsitsiklis, 1996)– Dynamic Channel Allocation (e.g. Singh & Bertsekas, 1997)– Vehicle Routing, etc.

• Economics– Trading,

• Robotics– Robocup Soccer (e.g. Stone & Veloso, 1999)– Helicopter Control (e.g. Ng, 2003, Abbeel & Ng, 2006)– Many Robots (navigation, bi-pedal walking, grasping, switching betweenskills, ...)

6/18

TD-Gammon, by Gerald Tesauro(See section 11.1 in Sutton & Barto’s book.)

• See (Tesauro, 1992, 1994, 1995)

• Only reward given at end of game for win.

• Self-play: use the current policy to sample moves on both sides!

• After about 300,000 games against itself, near the level of the world’sstrongest grandmasters.

7/18

GO using UCT, by Gelly(See Gelly et. al 2012, Communications of the ACM for a review.)

8/18

Reinfocement Learning in RoboticsLearning motor skills, Autonomous Helicopter Flight

(2000, by Schaal, Atkeson, Vijayakumar) (2007, Andrew Ng et al.)

(2014, playing Atari games by Google

Deepmind)

9/18

Reinforcement learning in neuroscience

(Yael Niv, ICML 2009’s tutorial.)

10/18

Reinforcement learning in neurosciencePeter Dayan and Yael Niv, Neurobiology 2008.

• The brain employs both model-free and model-based decision-makingstrategies in parallel, with each dominating in different circumstances.

11/18


s1a1r2s2a2r2 · · · siairi+1si+1 · · ·

• States can be vectors or other structures, defined as sufficientstatistics to predict what happens next.

• Actions/Controls can be multi-dimensional

• Rewards are scalar but can be arbitrarily uninformative, and might bedelayed; e.g., rt tells how well the agent does at time t (after takingaction at at st).

• Objective: is desribed as the maximization of expected total reward.

• States are sometimes not directly observable, unobservable.

o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·

• Agent has only partial knowledge about environment, e.g unknowndynamics, reward, observation functions, etc..

12/18


s1a1r2s2a2r2 · · · siairi+1si+1 · · ·






o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·


12/18


s1a1r2s2a2r2 · · · siairi+1si+1 · · ·






o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·


12/18


s1a1r2s2a2r2 · · · siairi+1si+1 · · ·






o1a1r2o2a2r2 · · · oiairi+1oi+1 · · ·

• Agent has only partial knowledge about environment, e.g unknowndynamics, reward, observation functions, etc.. 12/18

What is Reinforcement Learning?– Example of Rewards:

• +1/− 1 of winning/losing a game, e.g. GO, Backgammon, ...• +/− for increasing/decreasing score, e.g. in deep RL algorithms playing

Atari games.• +/− rewards for earning/losing money in managing an investment

portfolio.• +/− rewards for following the desired trajectory/for crashing in controlling

a stunt helicopter.• etc.

13/18

Components of An RL Agent– Policy: define behaviours of the agent, e.g a mapping π : S 7→ A orπ : S ×A 7→ [0, 1]

– Value Functions: the expected return from this state (if starting fromthis state).

V π(s) = Eπ[∑

t

γtRt|s0 = s]

– Model: the agent’s internal representation of the environment, e.g.P (s′|s, a), R(s, a, s′) .

14/18

Schedule of this course

• Part 1: The Basis

• Markov Decision Process (MDP), Partially Observable MDP (POMDP).• Dynamic Programming: Value Iteration, Policy Iteration

• Part 2: Reinforcement Learning Topics

• Temporal Difference learning, Q-Learning.• Reinforcement learning with function approximation• Policy search

• Part 3: Advanced Topics

• Inverse reinforcement learning, imitation learning.• Exploration vs. Exploitation: Multi-armed bandis, PAC-MDP, Bayesian

reinforcement learning.• Hierarchical reinforcement learning: macro actions, skill acquisition.• Deep reinforcement learning• Reinforcement learning in POMDP environment.

15/18

Schedule of this course

• Missing:– Relational MDP– MDP/POMDP/RL as Inference

16/18

Literature

Richard S. Sutton, Andrew Barto: Rein-forcement Learning: An Introduction. TheMIT Press Cambridge, MassachusettsLondon, England, 1998.http://webdocs.cs.ualberta.ca/

~sutton/book/the-book.html

Csaba Szepesvri: Algorithms for Rein-forcement Learning. Morgan & Claypoolin July 2010.http://www.ualberta.ca/~szepesva/

RLBook.html

17/18

http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html

http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html

http://www.ualberta.ca/~szepesva/RLBook.html

http://www.ualberta.ca/~szepesva/RLBook.html

Organisation• Course webpage::https://ipvs.informatik.uni-stuttgart.de/mlr/reinforcement-learning-ss15/

– Slides, Exercises– Links to other resources

• Secretary, admin issuesCarola Stahl, [email protected], Raum 2.217

• Lecture : Tue. 09:45-11:15, Room 0.124;• Tutorial: Wed. 14:00-15:30, Room 0.108

• Rules for the tutorials:– Doing the exercises is crucial!– At the beginning of each tutorial:

– sign into a list– mark which exercises you have (successfully) worked on

– Students are randomly selected to present their solutions– You need 50% of completed exercises to be allowed to the exam

(Prof. Marc Toussaint’s rules.)18/18

https://ipvs.informatik.uni-stuttgart.de/mlr/reinforcement-learning-ss15/

Documents

Reinforcement Learning Lecture Lecture 1: Introduction › mlr › wp... · What is Reinforcement Learning? – Reinforcement Learning is a subﬁeld of Machine Learning adapted from