CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning...

CS 3630

Lecture 24:A brief introduction to Deep Reinforcement Learning

This lecture borrows heavily from the Deep RL Boot Camp slides, in particular the slide decks by Peter Abbeel, Rocky Duan, Vlad Mnih, Andrej Karpathy, and John Schulman

Topics

1. Recap: MDP and RL Methods2. Deep Q-Learning3. Policy Optimization

Motivation • Deep learning success in perception• MDP and RL frameworks

• Well understood• Early successes (backgammon)• Not great on more complex

problems

• Can deep learning make RL really work?• Evidence points to yes!

Image from DeepMimic paper, Peng et al 2018

Some RL Success Stories

Overview

This lecture will not go in depth

Provide rough overview, pointers to excellent starting points

Lecture materials all cribbed from Deep RL Boot Camp, took place August 2017 in Berkeley

All slides/lectures are online:https://sites.google.com/view/deep-rl-bootcamp/home

Recap: Markov Decision Processes (MDP)

* P(s’|s,a) == T(s,a,s’) from Seth’s lectures

A note on Rewards

• Three different ways to do the reward:• R(s): reward is function of state only (Seth’s lectures)• R(s, a): reward in given state depends on action you take• R(s, a, s’): reward also depends on in what state you land• Most general• Matches Deep RL Bootcamp slides

Image copyright Stevens 2003, used here for academic purposes

Recap: Policy and Value Function

• Policy:• S -> A• Optimal 𝜋∗

• Value function:• Satisfies Bellman equations

if actions 𝑎 𝑠 = 𝜋∗(𝑠), i.e. actions are chosen chosen according to optimal policy 𝜋∗:

Optimal value 𝑉∗(𝑠)

• Exercise: check Bellman equations with given noise 0.2(actions only work 80% of the time) and discount factor 0.9

Recap: Value Iteration

Image borrowed from (useful!) Nvideo blog post on RL: https://devblogs.nvidia.com/deep-learning-nutshell-

reinforcement-learning/

Recap:Policy Iteration

Image borrowed from (useful!) blog post at https://mpatacchiola.github.io/blog/2016/12/09/dissecting-reinforcement-learning.html

Recap: Reinforcement Learning• Passive:• Direct utility estimation• Adaptive Dynamic Programming• Temporal Difference Learning (model-free!)

• Active:• Model-based: exploitation vs. exploration• Q-learning (model-free!)

Image by Praphul Singh, https://blogs.oracle.com/datascience/reinforcement-learning-deep-q-networks

2. Deep Q-learning

A simple (tabular) Q-learning algorithm:

Q-learning on gridworld

Q-learning properties

Can tabular Q-learning scale?

Enter DQN (2015 Nature paper, Mnih et al)

DQN overview

Neural Network to approximate Q-values

Learning Space Invaders

Value function, visualized

• 512-dim state space (final hidden layer)• “t-SNE” embedding

visualizes this in 2D• Color is value of state

Learning all the games!

• Human = 100%

3. Policy Gradient Method

• Optimize over stochastic policies

Why Policy Optimization

Policy leads to trajectories 𝜏

Vanilla Policy Gradient• 1992 (!) REINFORCE algorithm• Roll out many trajectories• Compares current policy with a baseline b• Adapt network to improve reward on trajectories that compare

advantageously (At) to the baseline

Improvements led to SOTA methods

• Rather complicated and beyond scope• DDPG: Deep Deterministic Policy Gradient• TRPO: trust-region Policy Optimization• PPO: Proximal Policy Optimization

• See blog post: https://openai.com/blog/openai-baselines-ppo/

• Some successes:• OpenAI gym:

• Atari:

Summary

1. Recap: MDP and RL Methods2. Deep Q-Learning (DQN) beats humans on many Atari

games, and is 10K citation Nature paper now3. Policy Optimization learns a network that takes a state

and outputs a stochastic action. Tricky to get to work, lots of heavy math, but great success in robotics-like domains.

CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning...

Documents

Cp 3630 General Relativity

Gable Homesheds TM - Stratco · CGH2 6170 9135 CGH3 6170 12135 CGH4 3630 6135 CGH5 3630 9135 CGH6 3630 12135 * Note: Width and length measured to outside of wall cladding. CONTACT

ANEXOS DEL BOLETÍN OFICIAL N° 3630 · ANEXOS DEL BOLETÍN OFICIAL N° 3630 ANEXO - DECRETO N° 126/11 N° 3630 - 23/3/2011 Separata del Boletín Oficial de la Ciudad de Buenos Aires

Alesis 3630 Compressor Manual.pdf

Precision Tower 3630 - Dell€¦ · Cable cover for Precision Tower 3630 The cable cover for Precision Tower 3630 helps protect ports and cables connected to the system. Follow these

Precision 3630 Tower

Final Module 2 WBS 3630 24.11.16

Aspire 3630 Travelmate 2430

CS 3630 Database Design and Implementation Dr. Qi Yang 213 Ullrich My Home Page: yangq/ The Class Page:

Studies in Conservation, Supplement 1, 2015 ISSN 0039-3630 · Studies in Conservation, Supplement 1, 2015 ISSN 0039-3630 Studies in Conservation, Supplement 1, 2015 ISSN 0039-3630

Kelas & Co A/S - LIM & SPARTELMASSE Pdfer/9_NY.pdfKelas Produktkatalog - Tlf. 45 3630 1660 - Fax. 45 3630 6344 - E-mail kelaskelas.dk - Kelas Produktkatalog - Tlf. 45 3630 1660 - Fax

Settimana Enigmistica 3630 (Con Soluzioni)

CS 3630 Database Design and Implementation. First Normal Form (1NF) No multi-value attributes Done when mapping E-R model to relational schema DBDL 2

H R 3630 MTR

Every effort has been made to ensure the accuracy and ... 3630 Vinyl Color Guide... · White 3630-20, 3632-20 Ivory 3630-005 Light Beige 3630-149 PANTONE® 468 C Silver 3630-121 PANTONE

Dvd Home Teatre Da-3630

Εμπρός 3630

CS 3630 Database Design and Implementation. Where Clause and Aggregate Functions -- List all rooms whose price is greater than the -- average room price

-IDS (/B) Multi 3630/3620 IDS Library... · Multi 3630/3620 IDS Overview ba77045d02 03/2020 5 1 Overview Meters of the series MultiLine Multi 3630/3620 IDS can be wirelessly con-nected

MM 3630 - jou.ufl.edu