Exploration and other applications of reinforcement...

Preview:

Citation preview

Exploration and Other Applications of Reinforcement Learning in Robotics

AS-84.4340Postgraduate Seminar in Automation Technology

Juhana Ahtiainen

OutlineIntroductionExploration

Information gainMonte Carlo algorithmActive LocalizationMapping in Occupancy GridMulti robots extensionExploration for SLAM

Other Applications of Reinforcement LearningRecent advancesRoboCupHumanoid robots

SummaryExercise

IntroductionReinforcement learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environmentso as to maximize some notion of long-term rewardExploration is the problem of controlling a robot so as to maximize its knowledge about the external world

Introduction The environment is usually modelled as a finite-state Markov Decision ProcessReinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those statesNever correct input-output pairs

ExplorationExploration problem is paramount in robotics

Abandoned mines, nuclear disasters, Mars...Exploration problem comes in many forms

Acquire a map in a static environmentKnown pose

Moving factors (dynamic environment)E.g. pursuit evasion problem

Active localizationKnown map

SLAMVirtually anywhere in robotics

POMDP and explorationFully subsumed by the POMDP framework?-POMDP in to an algorithm whose sole goal is to

maximize informationpayoff function = e.g information gain

Exploring using POMDP is often not a good idea

Number of unknown variables is hugeAlso the number of possible observations

Ch 17: Practical algorithmsFor high-dimensional exploration problemAll are geedy (look-ahead is limited to only one exploration action)Exploration action can involve a sequence of control actions

e.g select a location anywhere in the map moving there is considered a single exploration action

Information gainKey to exploration is informationEntropy Hp(x) of a probability distribution p is the expected information E[-log p]

Entropy is at its maximum when p is a uniform distribution and in its minimum when p is point-mass distributionIn exploration we seek to minimize the expected entropy of the belief after executing an action

Information gainConditional entropy of the state x’ after executing action u and measuring z:

Information gain associated with action u in belief b is given by the difference:

Conditional entropy with measurement integrated out:

Greedy techniquesExpected information lets us formulate the exploration problem as a decision theoretic problem addressed in the previous presentationsOptimal exploration maximizes the difference between the information gain and the costs

Greedy techniques

Utility of u compute expected entropy after executing u and observing

Previous equation resolves to:

Exploration techniquesMost of them are greedy

Optimal at time horizon 1Enormous branching factor in explorationGoal is to acquire new information

New belief stateAdjust policy

Exploration policies have to be highly reactive

Monte Carlo explorationSamples state x from

momentary belief b

Samples also the next state x’ and corresponding measurement z

New posterior belief

Entropy-cost trade-off

Action with higest MC information gain-cost value

Monte Carlo ExplorationMay still be very time consuming

Number of possible measurements can be huge

e.g. Robot with 24 ultrasonic sensort that report one byte of range data

25624 possible sonar scans in specific location

Active localizationSimplest case of exploration when estimating the state of a low-dimensional variableHere we seek information about robots pose but have a map of the environmentMoving to right place can make localization very fast

Active localization

Can be solved greedely but we need to define exploration actions differently

e.g. target locations in robots coordinate frameThis is ok if we can devise a low level module to map that action back into low-level controls

Example

Example

Example

Example

Example

Example

Analysis of active localizationGreedy

Cannot compose multiple exploration actionsAction definition

Open loop control while moving no measurementsReal robot can abandon target point (closed door)

Not considered during planning

Performs well in practise

Learning occupancy grid mapsMapping problems include many more unknown variablesWe treat the information gain as independent between different grid cellsHow to compute gain

EntropyExpected information gainBinary gain (frontier based exploration)

Calculating the information gainEntropy

StraightforwardThe brighter the larger

Expected information gainEntropy only measures current informationRequires assumptions on the nature of information

Calculating the information gainBinary Gain

Simplest of allBy far the most popularVery crude approximation of the expected information

Tends to work well in practiceCore of Frontier-based exploration

Propagating gainDefinition of an exploration action

Simple but effective move to x-y location along minimum cost path, and then sense all the grid in a small circular diameter around the robotValue iteration the best greedy exploration action

Learning OG maps -

summaryCrude approximationTotally ignores the information acquired as the robot movesTends to work well in practice

Extension to Multi-Robot SystemsAcquire a map through cooperative explorationThe speed up is usually linear, might be even 2K

Single robot might have to traverse many areas twice

CoordinationStatic greedy task allocation techniques

Value function for each robot

(minimum at the robots

pose)

Reset the gain map to zero in the vincity of the chosen cell

Optimal cell to explore for each robot

Summary of multi robot explorationSimple...Each robot greedely picks a best available goal and prohibits other from picking the same cell. Easily trapped in a minimum

Crossing pathsImproved coordination tehniques enable robots to trade goals

Example

SLAM in explorationIn SLAM we do not know the map nor the poseWithout knowledge about the pose the integration of sensor information can lead to serious errorsRobot that only focuses on pose does not move

Entropy decomposition!

Entropy decompositionFull SLAM posterior:

This implies:

The expectation is taken overSLAM entropy is the sum of the path entropy and the expected entropy of the map

Derivation of decomposition

Exploration in fastSLAMBased on grid-based fastSLAM (Ch 13.10)

Posterior by set of particlesEach particle contains a robot pathAlso occupancy grid map

FastSLAM exploration algorithm is a test-and-evaluate algorithm

Proposes a course of action for explorationEvaluates these actions by measuring the residual entropySelects action that minimize the resulting entropy

FastSLAM SummaryFastSLAM Exploration algorithm is an extension to Monte Carlo exploration algorithm with two insights

Applies to the full sequence of controlsTwo types of entropies!

One pertaining to the robots pathOne to the map

Example of SLAM

Example of SLAM

Example of SLAM

Reinforcement Learning in RoboticsReinforcement learning offers one of the most general frame-work to take traditional robotics towards true autonomy and versatilityIn many well-defined, low dimensional, discrete problems

Backgammon (Tesauro 1994)Elevator control (Crites & Barto 1996)Helicopter control (Bagnell & Schneider 2001)

Reinforcement Learning in RoboticsGoogle Scholar:

Results 1 - 10 of about 19,700 for Reinforcementlearning in robotics. Recent articles (since 2003)

4,340

Recent advances (2003)Curse of dimensionalityHierarchical reinforcement learning

Temporal abstractionsDecisions not required at each step

Semi-Markov Decision ProcessesGeneralization of MDPTime between one decision and another is a random variable, real- or integer-valued

allows the decision maker to choose actions whenever the system state changesmodels the system evolution in continuous timeallows the time spent in a particular state to follow an arbitrary probability distribution

Reinforcement Learning in RoboCup (http://www.robocup.org/)

Keepaway = keepers vs. takersMax 4 vs 3.

Large state space SMDP

In RoboCup alsoKick ball in to goal while avoiding an opponentFull team (11) learn collaborative passing and shooting (MC)Learn low level skills (drippling, passing, kicking)

RL for Humanoid robotsApplying RL to high dimensional movement systems like humanoid robots remains an unsolved problemGreedy algorithms are likely to fail

Natural Actor-Critic (Peters et al. 2005)Efficiently optimize nonlinear motor primitivesBased on natural gradient formulation

SummaryReinforcement learning has been applied succesfully on many areas of robotics

High dimensions are still a problemExploration is one application of RL

Maximize the knowledge gained by the robotActive localization – seeking pose, map knownMapping – pose known at all timesSLAM – decomposition of entropy, map and pose unknown

References1.

S.

Thrun, W.

Burgard, and D.

Fox. Probabilistic Robotics. MIT Press, Cambridge, MA, 20052.

Wikipedia, http://en.wikipedia.org/wiki/Reinforcement_learning (Sutton, Richard S., and Barto, Andrew G. (1998) Reinforcement Learning: An Introduction MIT Press.

)3.

Barto, A. G. and Mahadevan, S. (2003) Recent Advances in Hierarchical Reinforcement Learning Discrete Event Dynamic Systems

vol. 13(4), pages 341 -

379 4.

Peter Stone, Richard S. Sutton, and Gregory Kuhlmann. Reinforcement Learning for RoboCup- Soccer Keepaway. Adaptive Behavior, 13(3):165–188, 2005

5.

Peters J, Vijayakumar S, Schaal S (2003) Reinforcement learning for humanoid robotics. In: Humanoids2003, Third IEEE-RAS International Conference on Humanoid Robots, Karlsruhe, Germany, Sept.29-30.

6.

Maja

J Matarić, Reinforcement Learning in the Multi-Robot Domain, Autonomous Robots, 4(1), Mar 1997, 73-83

7.

William D. Smart and Leslie Pack Kaelbling, Effective Reinforcement Learning for Mobile Robots, International Conference on Robotics and Automation, May 11-15, 2002

8.

Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2), 215–219.

9.

Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems

(Vol. 8, pp. 1017–1023). Cambridge, MA: The MIT Press.10

.

Bagnell, J. A., & Schneider, J. (2001). Autonomous helicopter control using reinforcement learning policy search methods. In International Conference on Robotics and Automation

(pp. 1615–1620). IEEE.

11

.

Jan Peters, Sethu

Vijayakumar, Stefan Schaal

(2005), Natural Actor-Critic, in the Proceedings of the 16th European Conference on Machine Learning (ECML 2005).

Exercise:

Recommended