Upload
nguyenkhanh
View
226
Download
0
Embed Size (px)
Citation preview
Exploration and Other Applications of Reinforcement Learning in Robotics
AS-84.4340Postgraduate Seminar in Automation Technology
Juhana Ahtiainen
OutlineIntroductionExploration
Information gainMonte Carlo algorithmActive LocalizationMapping in Occupancy GridMulti robots extensionExploration for SLAM
Other Applications of Reinforcement LearningRecent advancesRoboCupHumanoid robots
SummaryExercise
IntroductionReinforcement learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environmentso as to maximize some notion of long-term rewardExploration is the problem of controlling a robot so as to maximize its knowledge about the external world
Introduction The environment is usually modelled as a finite-state Markov Decision ProcessReinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those statesNever correct input-output pairs
ExplorationExploration problem is paramount in robotics
Abandoned mines, nuclear disasters, Mars...Exploration problem comes in many forms
Acquire a map in a static environmentKnown pose
Moving factors (dynamic environment)E.g. pursuit evasion problem
Active localizationKnown map
SLAMVirtually anywhere in robotics
POMDP and explorationFully subsumed by the POMDP framework?-POMDP in to an algorithm whose sole goal is to
maximize informationpayoff function = e.g information gain
Exploring using POMDP is often not a good idea
Number of unknown variables is hugeAlso the number of possible observations
Ch 17: Practical algorithmsFor high-dimensional exploration problemAll are geedy (look-ahead is limited to only one exploration action)Exploration action can involve a sequence of control actions
e.g select a location anywhere in the map moving there is considered a single exploration action
Information gainKey to exploration is informationEntropy Hp(x) of a probability distribution p is the expected information E[-log p]
Entropy is at its maximum when p is a uniform distribution and in its minimum when p is point-mass distributionIn exploration we seek to minimize the expected entropy of the belief after executing an action
Information gainConditional entropy of the state x’ after executing action u and measuring z:
Information gain associated with action u in belief b is given by the difference:
Conditional entropy with measurement integrated out:
Greedy techniquesExpected information lets us formulate the exploration problem as a decision theoretic problem addressed in the previous presentationsOptimal exploration maximizes the difference between the information gain and the costs
Greedy techniques
Utility of u compute expected entropy after executing u and observing
Previous equation resolves to:
Exploration techniquesMost of them are greedy
Optimal at time horizon 1Enormous branching factor in explorationGoal is to acquire new information
New belief stateAdjust policy
Exploration policies have to be highly reactive
Monte Carlo explorationSamples state x from
momentary belief b
Samples also the next state x’ and corresponding measurement z
New posterior belief
Entropy-cost trade-off
Action with higest MC information gain-cost value
Monte Carlo ExplorationMay still be very time consuming
Number of possible measurements can be huge
e.g. Robot with 24 ultrasonic sensort that report one byte of range data
25624 possible sonar scans in specific location
Active localizationSimplest case of exploration when estimating the state of a low-dimensional variableHere we seek information about robots pose but have a map of the environmentMoving to right place can make localization very fast
Active localization
Can be solved greedely but we need to define exploration actions differently
e.g. target locations in robots coordinate frameThis is ok if we can devise a low level module to map that action back into low-level controls
Example
Example
Example
Example
Example
Example
Analysis of active localizationGreedy
Cannot compose multiple exploration actionsAction definition
Open loop control while moving no measurementsReal robot can abandon target point (closed door)
Not considered during planning
Performs well in practise
Learning occupancy grid mapsMapping problems include many more unknown variablesWe treat the information gain as independent between different grid cellsHow to compute gain
EntropyExpected information gainBinary gain (frontier based exploration)
Calculating the information gainEntropy
StraightforwardThe brighter the larger
Expected information gainEntropy only measures current informationRequires assumptions on the nature of information
Calculating the information gainBinary Gain
Simplest of allBy far the most popularVery crude approximation of the expected information
Tends to work well in practiceCore of Frontier-based exploration
Propagating gainDefinition of an exploration action
Simple but effective move to x-y location along minimum cost path, and then sense all the grid in a small circular diameter around the robotValue iteration the best greedy exploration action
Learning OG maps -
summaryCrude approximationTotally ignores the information acquired as the robot movesTends to work well in practice
Extension to Multi-Robot SystemsAcquire a map through cooperative explorationThe speed up is usually linear, might be even 2K
Single robot might have to traverse many areas twice
CoordinationStatic greedy task allocation techniques
Value function for each robot
(minimum at the robots
pose)
Reset the gain map to zero in the vincity of the chosen cell
Optimal cell to explore for each robot
Summary of multi robot explorationSimple...Each robot greedely picks a best available goal and prohibits other from picking the same cell. Easily trapped in a minimum
Crossing pathsImproved coordination tehniques enable robots to trade goals
Example
SLAM in explorationIn SLAM we do not know the map nor the poseWithout knowledge about the pose the integration of sensor information can lead to serious errorsRobot that only focuses on pose does not move
Entropy decomposition!
Entropy decompositionFull SLAM posterior:
This implies:
The expectation is taken overSLAM entropy is the sum of the path entropy and the expected entropy of the map
Derivation of decomposition
Exploration in fastSLAMBased on grid-based fastSLAM (Ch 13.10)
Posterior by set of particlesEach particle contains a robot pathAlso occupancy grid map
FastSLAM exploration algorithm is a test-and-evaluate algorithm
Proposes a course of action for explorationEvaluates these actions by measuring the residual entropySelects action that minimize the resulting entropy
FastSLAM SummaryFastSLAM Exploration algorithm is an extension to Monte Carlo exploration algorithm with two insights
Applies to the full sequence of controlsTwo types of entropies!
One pertaining to the robots pathOne to the map
Example of SLAM
Example of SLAM
Example of SLAM
Reinforcement Learning in RoboticsReinforcement learning offers one of the most general frame-work to take traditional robotics towards true autonomy and versatilityIn many well-defined, low dimensional, discrete problems
Backgammon (Tesauro 1994)Elevator control (Crites & Barto 1996)Helicopter control (Bagnell & Schneider 2001)
Reinforcement Learning in RoboticsGoogle Scholar:
Results 1 - 10 of about 19,700 for Reinforcementlearning in robotics. Recent articles (since 2003)
4,340
Recent advances (2003)Curse of dimensionalityHierarchical reinforcement learning
Temporal abstractionsDecisions not required at each step
Semi-Markov Decision ProcessesGeneralization of MDPTime between one decision and another is a random variable, real- or integer-valued
allows the decision maker to choose actions whenever the system state changesmodels the system evolution in continuous timeallows the time spent in a particular state to follow an arbitrary probability distribution
Reinforcement Learning in RoboCup (http://www.robocup.org/)
Keepaway = keepers vs. takersMax 4 vs 3.
Large state space SMDP
In RoboCup alsoKick ball in to goal while avoiding an opponentFull team (11) learn collaborative passing and shooting (MC)Learn low level skills (drippling, passing, kicking)
RL for Humanoid robotsApplying RL to high dimensional movement systems like humanoid robots remains an unsolved problemGreedy algorithms are likely to fail
Natural Actor-Critic (Peters et al. 2005)Efficiently optimize nonlinear motor primitivesBased on natural gradient formulation
SummaryReinforcement learning has been applied succesfully on many areas of robotics
High dimensions are still a problemExploration is one application of RL
Maximize the knowledge gained by the robotActive localization – seeking pose, map knownMapping – pose known at all timesSLAM – decomposition of entropy, map and pose unknown
References1.
S.
Thrun, W.
Burgard, and D.
Fox. Probabilistic Robotics. MIT Press, Cambridge, MA, 20052.
Wikipedia, http://en.wikipedia.org/wiki/Reinforcement_learning (Sutton, Richard S., and Barto, Andrew G. (1998) Reinforcement Learning: An Introduction MIT Press.
)3.
Barto, A. G. and Mahadevan, S. (2003) Recent Advances in Hierarchical Reinforcement Learning Discrete Event Dynamic Systems
vol. 13(4), pages 341 -
379 4.
Peter Stone, Richard S. Sutton, and Gregory Kuhlmann. Reinforcement Learning for RoboCup- Soccer Keepaway. Adaptive Behavior, 13(3):165–188, 2005
5.
Peters J, Vijayakumar S, Schaal S (2003) Reinforcement learning for humanoid robotics. In: Humanoids2003, Third IEEE-RAS International Conference on Humanoid Robots, Karlsruhe, Germany, Sept.29-30.
6.
Maja
J Matarić, Reinforcement Learning in the Multi-Robot Domain, Autonomous Robots, 4(1), Mar 1997, 73-83
7.
William D. Smart and Leslie Pack Kaelbling, Effective Reinforcement Learning for Mobile Robots, International Conference on Robotics and Automation, May 11-15, 2002
8.
Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2), 215–219.
9.
Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems
(Vol. 8, pp. 1017–1023). Cambridge, MA: The MIT Press.10
.
Bagnell, J. A., & Schneider, J. (2001). Autonomous helicopter control using reinforcement learning policy search methods. In International Conference on Robotics and Automation
(pp. 1615–1620). IEEE.
11
.
Jan Peters, Sethu
Vijayakumar, Stefan Schaal
(2005), Natural Actor-Critic, in the Proceedings of the 16th European Conference on Machine Learning (ECML 2005).
Exercise: