1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 1: Course Logistics, Introduction Dr. Itamar Arel College of Engineering Electrical

11

ECE-517: Reinforcement Learning in ECE-517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence

Lecture 1: Course Logistics, IntroductionLecture 1: Course Logistics, Introduction

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringElectrical Engineering and Computer Science DepartmentElectrical Engineering and Computer Science Department

The University of TennesseeThe University of TennesseeFall 2012Fall 2012

August 23, 2012August 23, 2012

ECE-517 - Reinforcement Learning in AI

But first, a quick anonymous survey But first, a quick anonymous survey ……

22

ECE-517 - Reinforcement Learning in AI 33

OutlineOutline

Course logistics and requirementsCourse logistics and requirements

Course roadmap & outlineCourse roadmap & outline

IntroductionIntroduction


Course ObjectivesCourse Objectives

Introduce the concepts & principles governing Introduce the concepts & principles governing reinforcement- based machine learning systemsreinforcement- based machine learning systemsReview fundamental theory Review fundamental theory

Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) Dynamic Programming (DP)Dynamic Programming (DP) Practical systems; role of Neural Networks in NDPPractical systems; role of Neural Networks in NDP

RL learning schemes (Q-Learning, TD-Learning, etc.)RL learning schemes (Q-Learning, TD-Learning, etc.)Limitations of existing techniques and how they can be Limitations of existing techniques and how they can be improved improved Discuss software and hardware implementation Discuss software and hardware implementation considerationsconsiderations

Long-term goal: Long-term goal: to contribute to your understanding of to contribute to your understanding of the formalism, trends and challenges in constructing RL-the formalism, trends and challenges in constructing RL-based agentsbased agents


Course PrerequisitesCourse Prerequisites

A course on probability theory or background in A course on probability theory or background in probability theory is requiredprobability theory is required

Matlab/C/C++ competencyMatlab/C/C++ competency A Matlab tutorial has been posted on the course A Matlab tutorial has been posted on the course

website (under the schedule page)website (under the schedule page)

Open mindedness & imagination …Open mindedness & imagination …


Course AssignmentsCourse Assignments

2 small projects 2 small projects Main goal – provide students with basic hands-on Main goal – provide students with basic hands-on

experience in ML behavioral simulation and result experience in ML behavioral simulation and result interpretation interpretation

Analysis complemented by simulation Analysis complemented by simulation MATLAB programming orientedMATLAB programming oriented

Reports should include all background, explanations and Reports should include all background, explanations and resultsresults

5 problem assignment sets5 problem assignment sets Will cover the majority of the topics discussed in classWill cover the majority of the topics discussed in class Assignments should be handed in Assignments should be handed in before before the beginning the beginning

of the classof the class

Final projectFinal project Each student/group is assigned a topic Each student/group is assigned a topic Project report & in-class presentationProject report & in-class presentation


Sony AIBO LabSony AIBO Lab

Located at MK 630Located at MK 630

6 Sony AIBO dog robots (36 Sony AIBO dog robots (3rdrd generation) generation)

Local wireless network (for communicating with Local wireless network (for communicating with the dogs)the dogs)

Code for lab project/s will be written in MatlabCode for lab project/s will be written in Matlab Interface has been preparedInterface has been prepared

Time slots should be coordinated withTime slots should be coordinated withInstructor & TAInstructor & TA


Textbooks & Reference MaterialTextbooks & Reference Material

Lecture notes will be posted weekly on the course Lecture notes will be posted weekly on the course website (website (web.eecs.utk.edu/~itamar/courses/ECE-517web.eecs.utk.edu/~itamar/courses/ECE-517) as well ) as well as …as …

Updated scheduleUpdated schedule Assignment sets, sample codesAssignment sets, sample codes GradesGrades General announcementsGeneral announcements

Reading assignments will be posted on schedule pageReading assignments will be posted on schedule page

Textbook:Textbook: R. Sutton and A. Barto, “R. Sutton and A. Barto, “Reinforcement Learning: An Reinforcement Learning: An

IntroductionIntroduction,” 1998. (available online!),” 1998. (available online!)


Grading policy & office hoursGrading policy & office hours

2 Small Projects – 25% (12.5 points each)2 Small Projects – 25% (12.5 points each)

5 Assignment sets – 25% (5 points each)5 Assignment sets – 25% (5 points each)

Midterm (in-class) – 20%Midterm (in-class) – 20%

Final project – 30%Final project – 30%

Instructor: Dr. Itamar ArelInstructor: Dr. Itamar Arel Office Hours: T/Tr 2:00 – 3:00 PM (MK 608)Office Hours: T/Tr 2:00 – 3:00 PM (MK 608) TA: Derek Rose (TA: Derek Rose ([email protected]) – office @ MK 606) – office @ MK 606

Office hours: contact TAOffice hours: contact TA My email: My email: [email protected]@eecs.utk.edu

Students are strongly encouraged to visit the Students are strongly encouraged to visit the course website (course website (web.eecs.utk.edu/~itamar/courses/ECE-517web.eecs.utk.edu/~itamar/courses/ECE-517) ) for announcements, lecture notes, updates etc.for announcements, lecture notes, updates etc.


UTK Academic Honesty Statement UTK Academic Honesty Statement

An essential feature of the University of Tennessee, An essential feature of the University of Tennessee, Knoxville, is a commitment to maintaining an Knoxville, is a commitment to maintaining an atmosphere of intellectual integrity and academic atmosphere of intellectual integrity and academic honesty. As a student of the university, honesty. As a student of the university, I pledge that I I pledge that I will neither knowingly give nor receive any will neither knowingly give nor receive any inappropriate assistance in academic workinappropriate assistance in academic work, thus , thus affirming my own personal commitment to honor and affirming my own personal commitment to honor and integrity.integrity.

Bottom lineBottom line: DO YOUR OWN WORK!: DO YOUR OWN WORK!


Understanding and constructing intelligenceUnderstanding and constructing intelligence

What is intelligence? How do we defined/evaluate it?What is intelligence? How do we defined/evaluate it?

How can we design adaptive systems that optimize their How can we design adaptive systems that optimize their performance as time goes by?performance as time goes by?

What are the limitations of RL based algorithms?What are the limitations of RL based algorithms?

How can artificial Neural Networks help us scale intelligent How can artificial Neural Networks help us scale intelligent systems?systems?

In what ways can knowledge be efficiently represented?In what ways can knowledge be efficiently represented?

This course is NOT about …This course is NOT about … RoboticsRobotics Machine learning (in the general sense)Machine learning (in the general sense) Legacy AI – symbolic reasoning, logic, etc.Legacy AI – symbolic reasoning, logic, etc. Image/vision/signal processingImage/vision/signal processing Control systems theoryControl systems theory

Why the course name “RL in AI” ?Why the course name “RL in AI” ?


Course Outline & RoadmapCourse Outline & Roadmap


Review of basic probability theoryReview of basic probability theory Discrete-time/space probability theoryDiscrete-time/space probability theory Discrete Markov ChainsDiscrete Markov Chains

Dynamic ProgrammingDynamic Programming Markov Decision Processes (MDPs)Markov Decision Processes (MDPs) Partially Observable Markov Decision Processes (POMDPs)Partially Observable Markov Decision Processes (POMDPs)

Approximate Dynamic Programming (a.k.a. Reinforcement Approximate Dynamic Programming (a.k.a. Reinforcement Learning)Learning)

Temporal Difference (TD) Learning, PlanningTemporal Difference (TD) Learning, Planning

Midterm – Tuesday, Oct 9, 2012Midterm – Tuesday, Oct 9, 2012

Neuro-Dynamic ProgrammingNeuro-Dynamic Programming Feedfoward & Recurrent Neural NetworksFeedfoward & Recurrent Neural Networks Neuro-dynamic RL architectureNeuro-dynamic RL architecture

Applications and case studiesApplications and case studies

Final project presentations – Nov 27– Dec 4, 2012Final project presentations – Nov 27– Dec 4, 2012

A detailed schedule is posted at the course websiteA detailed schedule is posted at the course website


OutlineOutline

Course logistics and requirementsCourse logistics and requirements

Course outline & roadmapCourse outline & roadmap


ECE-517 - Reinforcement Learning in AI

What is Machine Learning?What is Machine Learning?

Discipline focusing on computer algorithms that learn Discipline focusing on computer algorithms that learn to perform to perform “intelligent” tasks“intelligent” tasks

Learning is based on observation of dataLearning is based on observation of data

GenerallyGenerally: learning to do better in the future based on : learning to do better in the future based on what has been observed/experienced in the pastwhat has been observed/experienced in the past

ML is a core subarea of AI, which also intersects with ML is a core subarea of AI, which also intersects with physics, statistics, theoretical CS, etc.physics, statistics, theoretical CS, etc.

Examples of “ML” ProblemsExamples of “ML” Problems Optical character recognitionOptical character recognition Face detectionFace detection Spoken language understandingSpoken language understanding Customer segmentationCustomer segmentation Weather prediction, etc.Weather prediction, etc.

1414



Why do we need good ML technology?Why do we need good ML technology? Human beings are lazy creatures …Human beings are lazy creatures … Service roboticsService robotics

$10B market in 2015 – Japan only!$10B market in 2015 – Japan only! Pattern recognition (speech, vision)Pattern recognition (speech, vision) Data miningData mining Military applicationsMilitary applications … … many moremany more

Many ML problems can be Many ML problems can be formulated as RL problems …formulated as RL problems …


Introduction (cont.)Introduction (cont.)

Learning by Learning by interacting with our environment interacting with our environment is is probably the first to occur to us when we think about probably the first to occur to us when we think about the nature of learningthe nature of learning

Humans have no direct teachersHumans have no direct teachers

We do have direct sensormotor connection to the We do have direct sensormotor connection to the environmentenvironment

We learn as we go alongWe learn as we go along Interaction with environment teaches us what “works” Interaction with environment teaches us what “works”

and what doesn’tand what doesn’t We construct a “model” of our environmentWe construct a “model” of our environment

This course explores a This course explores a computationalcomputational approach to approach to learning from interaction with the environmentlearning from interaction with the environment


What is Reinforcement LearningWhat is Reinforcement Learning

Reinforcement learning is learning what to do – how to Reinforcement learning is learning what to do – how to map map situationssituations to to actionsactions – in order to maximize a long- – in order to maximize a long-term objective function driven by rewards.term objective function driven by rewards.

It is a form of unsupervised learningIt is a form of unsupervised learning

Two key components at the core of RL: Two key components at the core of RL: Trial-and-errorTrial-and-error – adapting internal representation, based – adapting internal representation, based

on experience, to improve future performanceon experience, to improve future performance Delayed rewardDelayed reward – actions are produced so as to yield long- – actions are produced so as to yield long-

term (not just short-term) rewardsterm (not just short-term) rewards

The “The “agentagent” must be able to:” must be able to: Sense its environmentSense its environment Produce actions that can affect the environmentProduce actions that can affect the environment Have a goal (“momentary” cost metric) relating to its stateHave a goal (“momentary” cost metric) relating to its state


What is Reinforcement Learning (cont’)What is Reinforcement Learning (cont’)

RL attempts to solve the RL attempts to solve the Credit AssignmentCredit Assignment problem problem

what is the what is the long-term impact long-term impact of an action taken nowof an action taken now

Unique to RL systems - major challenge in MLUnique to RL systems - major challenge in ML Necessitates an accurate model of the Necessitates an accurate model of the

environment being controlled/interacted-withenvironment being controlled/interacted-with Something animals and humans do very well, Something animals and humans do very well,

and computers do very poorlyand computers do very poorlyWe’ll spend most of the semester formalizing solutions We’ll spend most of the semester formalizing solutions to this problemto this problem

Philosophically Philosophically ComputationallyComputationally Practically (implementation considerations)Practically (implementation considerations)


The Big PictureThe Big Picture

Artificial Intelligence Artificial Intelligence Machine Learning Machine Learning Reinforcement Learning Reinforcement Learning

Types of Machine LearningTypes of Machine Learning

Supervised LearningSupervised Learning: learn from : learn from labeled exampleslabeled examples Unsupervised LearningUnsupervised Learning: process : process unlabeled unlabeled

examplesexamples Example: clustering data into groupsExample: clustering data into groups

Reinforcement LearningReinforcement Learning: learn from : learn from interactioninteraction Defined by the problemDefined by the problem Many approaches are possible (including evolutionary)Many approaches are possible (including evolutionary) Here we will focus on a particular family of approachesHere we will focus on a particular family of approaches Autonomous learningAutonomous learning


Software vs. HardwareSoftware vs. Hardware

Historically, ML has been in CS turfHistorically, ML has been in CS turf Confinement to the Von Neumann architectureConfinement to the Von Neumann architecture

Software limits scalability …Software limits scalability … Human brain has 10Human brain has 101111 processors operating processors operating

at onceat once However, each runs at ~150 HzHowever, each runs at ~150 Hz It’s the massive parallelism that gives it powerIt’s the massive parallelism that gives it power

Even 256 processors is not “massive parallelism”Even 256 processors is not “massive parallelism”

““Computer Engineering” perspectiveComputer Engineering” perspective FPGA devices (reconfigurable computing)FPGA devices (reconfigurable computing) GPUsGPUs ASIC ProspectASIC Prospect UTK/MIL group focusUTK/MIL group focus


Exploration vs. ExploitationExploration vs. Exploitation

A fundamental trade-off in RLA fundamental trade-off in RL ExploitationExploitation of what worked in the past (to yield high of what worked in the past (to yield high

reward)reward) ExplorationExploration of new, alternative action paths so as to of new, alternative action paths so as to

learn how to make better action selections in the learn how to make better action selections in the futurefuture

The dilemma is that neither exploration nor The dilemma is that neither exploration nor exploitation can be pursued exclusively without exploitation can be pursued exclusively without failing at the taskfailing at the taskOn a On a stochasticstochastic task, each action must be tried task, each action must be tried many times to gain a reliable estimate of its many times to gain a reliable estimate of its expected rewardexpected rewardWe will review mathematical methods proposed to We will review mathematical methods proposed to address this basic issueaddress this basic issue


The Reinforcement Learning FrameworkThe Reinforcement Learning Framework

Environment(a.k.a. Plant)

Agent(a.k.a. Controller)

Obse

rvati

on

sA

ction

s

Rew

ard


Some Examples or RLSome Examples or RL

A master chess playerA master chess player Planning-anticipating replies and counter-repliesPlanning-anticipating replies and counter-replies Immediate, intuitive judgmentImmediate, intuitive judgment

A mobile robot decides whether to enter a room or try A mobile robot decides whether to enter a room or try to find its way back to a battery-charging stationto find its way back to a battery-charging stationPlaying backgammonPlaying backgammon

Obviously, a strategy is necessaryObviously, a strategy is necessary Some luck involved (stochastic game)Some luck involved (stochastic game)

In all cases, the agent tries to achieve a goal despite In all cases, the agent tries to achieve a goal despite uncertainty about its environmentuncertainty about its environmentThe affect of an action cannot be fully predictedThe affect of an action cannot be fully predictedExperience allows the agent to improve its Experience allows the agent to improve its performance over timeperformance over time


Origins of Reinforcement LearningOrigins of Reinforcement Learning

Artificial Intelligence Artificial Intelligence

Control Theory (MDP)Control Theory (MDP)

Operations ResearchOperations Research

Cognitive Science and PsychologyCognitive Science and Psychology

More recently, NeuroscienceMore recently, Neuroscience

RL has solid foundations and is a well-RL has solid foundations and is a well-established research fieldestablished research field


Elements of Reinforcement LearningElements of Reinforcement Learning

Beyond the agent and the environment, we have the Beyond the agent and the environment, we have the following following fourfour main elements in RL main elements in RL

1)1) PolicyPolicy - defines the learning agent's way of - defines the learning agent's way of behavingbehaving at at any given time. Roughly speaking, a policy is a mapping any given time. Roughly speaking, a policy is a mapping from (perceived) from (perceived) statesstates of the environment to of the environment to actionsactions to to be taken when in those states.be taken when in those states.o Usually stochastic (adapts as you go along)Usually stochastic (adapts as you go along)o Enough to determine the agent’s behaviorEnough to determine the agent’s behavior

2)2) Reward function Reward function - defines the - defines the goalgoal in a RL learning in a RL learning problem. Roughly speaking, it maps each perceived state problem. Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single (or state-action pair) of the environment to a single number, a number, a rewardreward, indicating the intrinsic desirability of , indicating the intrinsic desirability of that statethat stateo Agent’s goal is to maximize the reward Agent’s goal is to maximize the reward over timeover timeo May be stochasticMay be stochastico Drive the policy employed and its adaptationDrive the policy employed and its adaptation


Elements of Reinforcement Learning (cont.)Elements of Reinforcement Learning (cont.)

3)3) Value functionValue function Whereas a reward function indicates what Whereas a reward function indicates what is good in an immediate sense, a is good in an immediate sense, a value functionvalue function specifies specifies what is good in the what is good in the long runlong run. .

Roughly speaking, the Roughly speaking, the valuevalue of a state is the total amount of a state is the total amount of reward an agent can expect to accumulate over the of reward an agent can expect to accumulate over the future, starting from that state.future, starting from that state.

It allows the agent to look over the “horizon”It allows the agent to look over the “horizon”Actions are derived from value estimations, not rewardsActions are derived from value estimations, not rewardsWe measure rewards, but we estimate and act upon values We measure rewards, but we estimate and act upon values – corresponds to strategic/long-term thinking– corresponds to strategic/long-term thinking

Intuitively a prerequisite for intelligence/intelligent-Intuitively a prerequisite for intelligence/intelligent-control (plants vs. animals)control (plants vs. animals)

Obtaining a good value function is a key challenge in Obtaining a good value function is a key challenge in designing good RL systemsdesigning good RL systems


Elements of Reinforcement Learning (cont.)Elements of Reinforcement Learning (cont.)

4)4) Model Model – an observable entity that mimics the behavior of – an observable entity that mimics the behavior of the environment. the environment. For example, given a state and action, the model might For example, given a state and action, the model might

predict the resultant next state and next reward predict the resultant next state and next reward As we will later discuss, predictability and auto-associative As we will later discuss, predictability and auto-associative

memory, are key attributes of the mammal brainmemory, are key attributes of the mammal brain Models are used for Models are used for planningplanning – any way of deciding on a – any way of deciding on a

course of action by considering possible future scenarios course of action by considering possible future scenarios prior to them actually occurringprior to them actually occurring

Note that RL can work (sometimes very well) with an Note that RL can work (sometimes very well) with an incomplete modelincomplete model

We’ll go over a range of model platforms to achieve the We’ll go over a range of model platforms to achieve the aboveabove

As a side note: RL is essentially an optimization problem. As a side note: RL is essentially an optimization problem. However, it is one of the many optimization problems However, it is one of the many optimization problems that are extremely hard to (optimally) solve.that are extremely hard to (optimally) solve.


An Extended Example: Tic-Tac-ToeAn Extended Example: Tic-Tac-Toe

Consider a classical tic-tac-toe game, whereby the Consider a classical tic-tac-toe game, whereby the winner places three marks in a row, horizontally, winner places three marks in a row, horizontally, vertically or diagonallyvertically or diagonally

Let’s assume:Let’s assume: We are playing against an imperfect playerWe are playing against an imperfect player Draws and losses are equally bad for usDraws and losses are equally bad for us

Q: Can we design a player that’ll find imperfections in Q: Can we design a player that’ll find imperfections in the opponent’s play and learn to maximize chances of the opponent’s play and learn to maximize chances of winning?winning?

Classical machine learning schemes would never visit a Classical machine learning schemes would never visit a state that has the potential to lead to a lossstate that has the potential to lead to a loss

We want to exploit the weaknesses of the opponent, so We want to exploit the weaknesses of the opponent, so we may decide to visit a state that has the potential of we may decide to visit a state that has the potential of leading to a lossleading to a loss


An Extended Example: Tic-Tac-Toe (cont.)An Extended Example: Tic-Tac-Toe (cont.)

Using Using dynamic programmingdynamic programming (DP), we can (DP), we can computecompute an optimal solution for any opponentan optimal solution for any opponent

However, we would need specifications of the opponent However, we would need specifications of the opponent (e.g. state-action probabilities)(e.g. state-action probabilities)

Such information is usually unavailable to usSuch information is usually unavailable to us

In In RLRL we estimate this information from experience we estimate this information from experience

We later apply DP, or other sequential decision We later apply DP, or other sequential decision making schemes, based on the model we obtained making schemes, based on the model we obtained by experienceby experience

A A policypolicy tells the agent how to make its next move tells the agent how to make its next move based on the state of the boardbased on the state of the board

Winning probabilities can be derived by knowing opponentWinning probabilities can be derived by knowing opponent



How do we solve this in RL …How do we solve this in RL … Set up a table of numbers – one for each state of the gameSet up a table of numbers – one for each state of the game

This number will reflect the probability of winning from that This number will reflect the probability of winning from that particular stateparticular state

This is treated as the state’s This is treated as the state’s valuevalue, and the entire learned table , and the entire learned table denotes the denotes the value functionvalue function

If V(If V(aa)>V()>V(bb) ) state state aa is preferred over state is preferred over state bb All states with three X’s in a row have win prob. of 1All states with three X’s in a row have win prob. of 1 All states with three O’s in a row have win prob. of 0All states with three O’s in a row have win prob. of 0 All other states are preset to prob. 0.5All other states are preset to prob. 0.5 When playing the game, we make a move that we predict When playing the game, we make a move that we predict

would result in a state with the highest value (would result in a state with the highest value (exploitationexploitation)) Occasionally, we chose randomly among the non-zero Occasionally, we chose randomly among the non-zero

valued states (valued states (exploratory movesexploratory moves))





While playing, we update the values of the statesWhile playing, we update the values of the states

The current value of the earlier state is adjusted to be The current value of the earlier state is adjusted to be closer to the value of the later statecloser to the value of the later state

V(s) V(s) V(s) + V(s) + [V(s’) –V(s)][V(s’) –V(s)]

where: 0<where: 0< is a learning parameter (step-size param.) is a learning parameter (step-size param.)

s – state before move / s’ – state after moves – state before move / s’ – state after moveThis update rule is an example of This update rule is an example of Temporal-Difference Temporal-Difference LearningLearning method method

This method performs quite well – converges to the This method performs quite well – converges to the optimal policy (for a fixed opponent)optimal policy (for a fixed opponent)

Can be adjusted to allow for slowly-changing opponentsCan be adjusted to allow for slowly-changing opponents



RL features encountered …RL features encountered … Emphasis on Emphasis on learning from interactionlearning from interaction – in this case with – in this case with

the opponentthe opponent Clear goalClear goal – correct planning takes into account delayed – correct planning takes into account delayed

rewardsrewardsFor example setting up traps for a shortsighted opponentFor example setting up traps for a shortsighted opponent

No modelNo model of opponent exists of opponent exists a prioria priori

Although this example is a good one, RL methods can …Although this example is a good one, RL methods can … be applied to infinite horizon problems (not state-terminal)be applied to infinite horizon problems (not state-terminal) be applied to cases where there is no external adversary be applied to cases where there is no external adversary

(e.g. “game against nature”)(e.g. “game against nature”)

Backgammon example Backgammon example 10 102020 states using Neural Nets states using Neural Nets


For next class …For next class …

Read chapter 1 in SB (up to and including section 1.5)Read chapter 1 in SB (up to and including section 1.5)

Read chapter 2 in SBRead chapter 2 in SB

Documents

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 1: Course Logistics, Introduction Dr. Itamar Arel College of Engineering Electrical