7
Towards a Second Generation Random Walk Planner: An Experimental Exploration Hootan Nakhost and Martin M ¨ uller University of Alberta, Edmonton, Alberta, Canada {nakhost,mmueller}@ualberta.ca Abstract Random walks have become a popular component of recent planning systems. The increased explo- ration is a valuable addition to more exploitative search methods such as Greedy Best First Search (GBFS). A number of successful planners which incorporate random walks have been built. The work presented here aims to exploit the experience gained from building those systems. It begins a systematic study of the design space and alterna- tive choices for building such a system, and de- velops a new random walk planner from scratch, with careful experiments along the way. Four ma- jor insights are: 1. a high state evaluation frequency is usually superior to the endpoint-only evaluation used in earlier systems, 2. adjusting the restarting parameter according to the progress speed in the search space performs better than any fixed setting, 3. biasing the action selection towards preferred operators of only the current state is better than Monte Carlo Helpful Actions, which depend on the number of times an action has been a preferred op- erator in previous walks, and 4. even simple forms of random walk planning can compete with GBFS. 1 Introduction The most common current technique for building satisfic- ing planning systems is heuristic search [Bonet and Geffner, 2001]. In IPC-2011, it was used by 25 out of 27 planners in the deterministic satisficing track. Most of these planners use greedy search algorithms such as Greedy Best First Search (GBFS), weighted A* and Enforced Hill Climbing. These algorithms mainly exploit the heuristic and do not explore the search space much. This lack of exploration hurts per- formance in case of inaccurate heuristic values, which are very common in the automatically generated heuristic func- tions of domain independent planning. A search algorithm that is more robust with inaccurate or misleading heuristics is not only valuable, but essential to improve the state of the art. [Nakhost and M¨ uller, 2009] introduced Monte Carlo Ran- dom Walks (MRW), an explorative, forward chaining local search method. MRW runs bounded length random walks (RW) to an endpoint which is evaluated by a heuristic h. Af- ter sampling, an endpoint with lowest h-value becomes the next state. Previous systems that utilize RW include many Stochastic Local Search algorithms [Hoos and St¨ utzle, 2004], Rapidly- Exploring Random Trees (RRT) in robot path planning [Alc´ azar et al., 2011], and the planners Identidem [Coles et al., 2007], Roamer [Lu et al., 2011], and Arvand [Nakhost and M¨ uller, 2009; Nakhost et al., 2011; Xie et al., 2012; Nakhost et al., 2012; Valenzano et al., 2012]. The current study uses the experience from this first generation of RW planners to build Arvand-2013, a second generation system, from scratch. Each design step is justified by careful exper- iments, which also address important open questions about RW planning. Does it pay off to evaluate more of the gener- ated states, and if so, which fraction? What are effective ways to control the length of walks and restarting? Is it possible to improve random walks by using preferred operators, beyond MHA [Nakhost and M¨ uller, 2009]? Like most experimen- tal papers on AI planning, experiments are run on recent IPC benchmarks. 2 Building a Second Generation RW Planner 2.1 The Experimental Framework Like many current systems, the new planner Arvand-2013 is built on top of the Fast Downward (FD) code base [Helmert, 2006]. All tested algorithms share the FD implementation of successor generator, heuristic function, and representation of states and actions. Tests are run on all IPC-2011 domains on a 2.5 GHz machine with 4GB memory and 30 minutes per in- stance. Results for randomized planners are averaged over 5 runs, which is mandated by computational resource limits but already quite reliable, especially in cases of frequent restarts within each run. The main focus is on coverage - the number of problems solved. 2.2 Baseline Arvand-2013: a Simple RW Planner Despite many recent experiments on RW planning, basics such as controlling the length of random walks and the restart- ing strategy have not been further explored since the original work of [Nakhost and M¨ uller, 2009]. Does the effectiveness of a restarting strategy depend on the walk length distribu- tion? Is there a robust strategy performing well across all or Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence 2336

Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

Towards a Second Generation Random WalkPlanner: An Experimental Exploration

Hootan Nakhost and Martin MullerUniversity of Alberta, Edmonton, Alberta, Canada

{nakhost,mmueller}@ualberta.ca

Abstract

Random walks have become a popular componentof recent planning systems. The increased explo-ration is a valuable addition to more exploitativesearch methods such as Greedy Best First Search(GBFS). A number of successful planners whichincorporate random walks have been built. Thework presented here aims to exploit the experiencegained from building those systems. It begins asystematic study of the design space and alterna-tive choices for building such a system, and de-velops a new random walk planner from scratch,with careful experiments along the way. Four ma-jor insights are: 1. a high state evaluation frequencyis usually superior to the endpoint-only evaluationused in earlier systems, 2. adjusting the restartingparameter according to the progress speed in thesearch space performs better than any fixed setting,3. biasing the action selection towards preferredoperators of only the current state is better thanMonte Carlo Helpful Actions, which depend on thenumber of times an action has been a preferred op-erator in previous walks, and 4. even simple formsof random walk planning can compete with GBFS.

1 IntroductionThe most common current technique for building satisfic-ing planning systems is heuristic search [Bonet and Geffner,2001]. In IPC-2011, it was used by 25 out of 27 planners inthe deterministic satisficing track. Most of these planners usegreedy search algorithms such as Greedy Best First Search(GBFS), weighted A* and Enforced Hill Climbing. Thesealgorithms mainly exploit the heuristic and do not explorethe search space much. This lack of exploration hurts per-formance in case of inaccurate heuristic values, which arevery common in the automatically generated heuristic func-tions of domain independent planning. A search algorithmthat is more robust with inaccurate or misleading heuristicsis not only valuable, but essential to improve the state of theart. [Nakhost and Muller, 2009] introduced Monte Carlo Ran-dom Walks (MRW), an explorative, forward chaining localsearch method. MRW runs bounded length random walks

(RW) to an endpoint which is evaluated by a heuristic h. Af-ter sampling, an endpoint with lowest h-value becomes thenext state.

Previous systems that utilize RW include many StochasticLocal Search algorithms [Hoos and Stutzle, 2004], Rapidly-Exploring Random Trees (RRT) in robot path planning[Alcazar et al., 2011], and the planners Identidem [Coles etal., 2007], Roamer [Lu et al., 2011], and Arvand [Nakhostand Muller, 2009; Nakhost et al., 2011; Xie et al., 2012;Nakhost et al., 2012; Valenzano et al., 2012]. The currentstudy uses the experience from this first generation of RWplanners to build Arvand-2013, a second generation system,from scratch. Each design step is justified by careful exper-iments, which also address important open questions aboutRW planning. Does it pay off to evaluate more of the gener-ated states, and if so, which fraction? What are effective waysto control the length of walks and restarting? Is it possible toimprove random walks by using preferred operators, beyondMHA [Nakhost and Muller, 2009]? Like most experimen-tal papers on AI planning, experiments are run on recent IPCbenchmarks.

2 Building a Second Generation RW Planner2.1 The Experimental FrameworkLike many current systems, the new planner Arvand-2013 isbuilt on top of the Fast Downward (FD) code base [Helmert,2006]. All tested algorithms share the FD implementation ofsuccessor generator, heuristic function, and representation ofstates and actions. Tests are run on all IPC-2011 domains ona 2.5 GHz machine with 4GB memory and 30 minutes per in-stance. Results for randomized planners are averaged over 5runs, which is mandated by computational resource limits butalready quite reliable, especially in cases of frequent restartswithin each run. The main focus is on coverage - the numberof problems solved.

2.2 Baseline Arvand-2013: a Simple RW PlannerDespite many recent experiments on RW planning, basicssuch as controlling the length of random walks and the restart-ing strategy have not been further explored since the originalwork of [Nakhost and Muller, 2009]. Does the effectivenessof a restarting strategy depend on the walk length distribu-tion? Is there a robust strategy performing well across all or

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence

2336

Page 2: Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

at least most planning domains? If not, is it possible to dy-namically learn the most effective strategies?

The planner Arvand-2013 is built bottom-up, starting withthe simple RW planner shown in Algorithm 1. The first ex-periment studies the effects of restarting and the length ofwalks in this planner. In the forward chaining local search,each run consists of one or more search episodes, and ter-minates when the current episode meets a termination condi-tion. Episodes start from initial state s0 and perform a seriesof search steps until a restart condition or termination condi-tion becomes true. Let hmin be the minimum heuristic valuereached in the current episode. Each search step stepi startsfrom state si−1 and ends in si with h(si) < hmin. In stepi,the planner runs a series of random walks to select si. Whena random walk reaches a state s with h(s) < hmin, the al-gorithm immediately jumps there, setting si = s. The partialplan sequence of actions from s0 to si is tracked. The twotermination conditions are:1. reaching a goal state sG. In this case the planner returnsthe sequence of actions from s0 to sG.2. exceeding a time limit. No plan is returned (details omittedfrom the code).

Like Arvand, local search restarts using a restart thresholdtg , whenever tg successive walks fail to improve hmin. Tobegin a new episode, the current state is reset to s0 .

Within a random walk (Algorithm 2), baseline Arvand-2013 evaluates all visited states, unlike Arvand. A walk stopsearly if it reaches a goal state, achieves an h-value belowhmin, or hits a dead end; otherwise it continues until termi-nating as in the restarting random walks model of [Nakhostand Muller, 2012], with fixed probability rl at each step. rl iscalled the local restarting rate. In the absence of early stops,the length of walks is geometrically distributed with mean1/rl. As the heuristic function, Arvand-2013 uses the cost-sensitive version of hFF [Hoffmann and Nebel, 2001] fromthe FD code base.

Algorithm 1 Monte Carlo Random Walk PlanningInput Initial State s0, goal condition G, heuristic hOutput Solution planParameters tg , rlc← s0 {current state}hmin ← h(c)loops← randomWalk(c, G, hmin, rl)if s ⊇ G then

return plan from s0 to selse if s 6= Deadend and h(s) < hmin thenc← s; hmin ← h(c)

else if restart() thenc← s0; hmin ← h(c)

end ifend loop

2.3 Parameters for Global and Local RestartsThe first experiment studies the coverage of Arvand-2013 asa function of the parameters tg and rl which control global

Algorithm 2 Random WalkInput current state c, goal condition G, hmin

Output sampled state sParameter rl

loops← c; A← applicableActions(s)if A = ∅ or h(s) =∞ then

return Deadendend ifa← uniformlyRandomSelectFrom(A)s← apply(s, a)if h(s) < hmin or s ⊇ G then

return send ifwith probability rl: return s

end loop

and local restarts. Out of 14 IPC-2011 domains, ten with in-teresting results are shown in Figure 1. Not shown are BAR-MAN and TRANSPORT, where no configuration solved morethan 10% of problems, and OPENSTACKS and PEGSOL, withmore than 90% coverage in all configurations. Key observa-tions are:

• This very basic RW planner already solves 126.6 of 280IPC-2011 problems with the best tested fixed thresholds oftg = 100, rl = 0.01. Section 2.6 gives a more comprehen-sive comparison with other planners.

• No single setting performs well across all domains. Largerrl values leading to shorter walks perform better in NO-MYSTERY and WOODWORKING, but worse in TIDYBOTand VISITALL. In ELEVATORS, restarting rarely withtg = 10000 increases the coverage compared with fre-quent restarting with tg = 100. In NOMYSTERY, FLOOR-TILE, PARCPRINTER and TIDYBOT, such frequent restartsare better.

• The two parameters rl and tg are mostly independent.Larger rl are always better in NOMYSTERY and WOOD-WORKING and always worse in TIDYBOT and VISITALL,independent of the choice of tg . Higher tg values are neverworse in WOODWORKING or ELEVATORS, but detrimentalin the other domains, independent of rl.

In light of these results, finding robust settings for tg andrl that work well across all domains seems infeasible. This isnot surprising considering the widely varying characteristicsof IPC-2011 domains. A parameter learning system can helpto fully realize the potential of RW.

2.4 Adaptive Global RestartingWhy does Arvand-2013 with a small restarting threshold per-form well in ELEVATORS but fail in FLOORTILE and NO-MYSTERY? Is it possible to learn an effective restarting strat-egy on the fly? Figure 2 shows details of two typical exam-ples, contrasting ELEVATORS and FLOORTILE. The graphsplot hmin as a function of the number of RW for tg = 1000and tg = 10000. In FLOORTILE, hmin decreases veryquickly at first, then stalls in a dead end or very large local

2337

Page 3: Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

Figure 1: Coverage of BRW for rl ∈ {0.1, 0.01, 0.001}, tg ∈{100, 1000, 10000}.

minimum. Here, a large tg wastes lots of time exploring thosedead ends and local minima, while restarting more often witha small tg increases exploration and the chance of reachinga goal. ELEVATORS shows the opposite behaviour: the plotsshow steady, slow progress towards hmin = 0. Fast restartsterminate the search before it can reach a goal.

Let Vw (V for velocity, w for walks) be the average heuris-tic improvement per walk, so on average, about h(s0)/Vwwalks should reach h = 0. Algorithm 3, Adaptive globalrestarting (AGR), adjusts tg by continually estimating Vw andsetting tg = h(s0)/Vw. AGR initializes tg = 1000 and up-dates both tg and the estimated Vw after each episode. Beforethe i-th episode, AGR measures V i

w, the average number ofrandom walks to reach hmin and sets Vw = avgj≤iV

jw.

Figure 3 compares AGR with restarting with fixed tg .While not always best, AGR is a robust, close to best choicein all domains. With AGR and rl = 0.01, Arvand-2013

10

20

30

40

50

60

70

0 10000 20000 30000 40000 50000

tg=1000tg=10000

50

100

150

200

250

300

350

400

0 10000 20000 30000 40000 50000

tg=1000tg=10000

Figure 2: Progress of hmin, depending on tg , in FLOORTILE-01 (top) and ELEVATORS-03(bottom).

Algorithm 3 Monte Carlo Random Walks using AGRInput Initial State s0, goal condition GOutput A solution planParameters tg, rlc← s0; hmin ← h(s0)r ← 0; w ← 0 {number of restarts; walks}li← 0 {last improving walk}Vw ← 0loops← RandomWalk(c, G, hmin, rl){sampled state}++wif s ⊇ G then

return plan reaching selse if s 6= Deadend and h(s) < hmin thenc← shmin ← h(s)li← w

else if w − li > tg thenV iw ← (h(s0)− hmin)/liVw ← (V i

w − Vw)/r + Vw {update estimate}tg ← h(s0)/Vw {update tg}c← s0{restart from initial state}hmin ← h(s0)w ← 0; ++r

end ifend loop

2338

Page 4: Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

Figure 3: Coverage of AGR versus fixed threshold restartingwith tg ∈ {100, 1000, 10000}, with fixed rl = 0.01.

solves 149 of 280 problems, 22 more than the best tested fixedthresholds of tg = 100, rl = 0.01.

2.5 Adaptive Local RestartingAs for global restarting, an adaptive algorithm can improvelocal restarting. As motivation, Figure 4 plots hmin againstnumber of evaluated nodes in VISITALL-15 and ELEVATORS-05, for rl = {0.1, 0.01, 0.001} and tg = 10000. Let Ve(r)(e for evaluations) be the average heuristic improvement perevaluation when rl = r. Larger Ve indicate faster progress to-wards a goal. In VISITALL, smaller rl settings achieve fasterprogress (larger Ve). The opposite happens in ELEVATORS.

Adaptive local restarting (ALR) is a multi-armed banditmethod [Gittins et al., 2011] that estimates Ve(.) to learn thebest rl. Before each random walk, ALR selects rl = ri froma candidate set C = {r1, . . . , rn}. Each ri can be consideredone arm of the bandit. For each ri, ALR tracks the aver-age number of evaluations avge(ri) and the average heuris-tic improvement avgh(ri), which is bounded below by 0.avgh(ri)/avge(ri) is used as estimate for Ve(ri). ALR sam-ples arms in an ε-greedy manner [Sutton and Barto, 1998]: anarm is selected uniformly at random with probability ε ≥ 0,and with probability 1 − ε, an arm with largest estimatedVe(ri) is chosen.

Figure 5 compares ALR using ε ∈ {0.1, 1} with threefixed settings rl ∈ {0.1, 0.01, 0.001}. To ensure com-parable results, the ALR candidate set is the same, C ={0.1, 0.01, 0.001}. In all configurations, AGR is used forglobal restarting. Key observations are:• ALR performs robustly across all domains: the gap be-

tween ALR and the best fixed setting for a domain is nevermore than 10%, except in ELEVATORS where rl = 0.1solves 15% more problems.

• Sampling based on Ve(.) gives a small advantage over uni-form sampling: Setting ε = 0.1 solves 6 (2%) more prob-lems than uniform sampling with ε = 1.

2.6 First Comparison with Systematic SearchThis experiment studies how Arvand-2013 at this stage of de-velopment compares with GBFS, a popular systematic searchplanner. To keep the playing field level at this point, GBFS

0

200

400

600

800

1000

1200

1400

1600

0 50000 100000 150000 200000

rl=0.1rl=0.01

rl=0.001

50

100

150

200

250

300

350

0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06

rl=0.1rl=0.01

rl=0.001

Figure 4: hmin progress with number of evaluated states inVISITALL-14 (top) and ELEVATORS-05 (bottom).

uses the same single heuristic hFF and no preferred opera-tors. Figure 6 compares the coverage:• Arvand-2013 and GBFS have very different strengths and

weaknesses: Arvand-2013 on average solves 5 (25%) to16 (80%) more problems in ELEVATORS, PARCPRINTER,and VISITALL, GBFS solves 7 (35%) to 15 (75%) more inBARMAN, PARKING, and SOKOBAN.

• Overall, Arvand-2013 is about level with GBFS, solving 5(2%) more problems.

3 The Rate of Heuristic EvaluationWhile heuristic state evaluations provide key information toguide search, computing a strong heuristic such as hFF iscostly. Two methods which reduce the number of evaluationsare deferred evaluation [Helmert, 2006], which uses the par-ent’s evaluation for a node, and MRW [Nakhost and Muller,2009], which evaluates only the endpoint of a random walk.The next experiment varies the frequency of state evaluations:Instead of all states as in the baseline, Arvand-2013 evaluates:1. the endpoint of each random walk as in Arvand, and2. intermediate states with probability peval.This interpolates between the baseline algorithm with peval =1 and MRW with peval = 0. ALR with ε = 0.1 and AGR areused as in previous versions. Figure 7 shows the coveragein IPC-2011 when varying peval. Four categories of domainsemerge:• Domains where more evaluation always hurts: Arvand-

2013 solves 100% of OPENSTACKS, VISITALL and

2339

Page 5: Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

Figure 5: Coverage of ALR and local restarting with fixedrate, ε ∈ {0.1, 1}, rl ∈ {0.1, 0.01, 0.001}.

Figure 6: Coverage of simple versions of GBFS and Arvand-2013.

WOODWORKING even with peval = 0. RW search is soeffective that higher evaluation rates only increase the run-time. In TIDYBOT, peval = 0 is also best, but coverage isbelow 100%. Here, hFF is both very costly and mislead-ing. For reference, even a blind random walk search, usingonly goal checks but no evaluation at all, can solve 90%of TIDYBOT instances! Only one planner in the IPC-2011competition, BRT [Alcazar and Veloso, 2011], surpassedthat number.

• Domains where more evaluation always pays off: PARC-PRINTER and NOMYSTERY. Running time decreases withincreasing peval, so spending more time on evaluation isworth it.

• An intermediate evaluation rate works best in ELEVATORS,FLOORTILE, and PARKING. peval = 0 provides too littleinformation and peval = 1 is too slow.• In SCANALYZER, PEGSOL, SOKOBAN, and TRANSPORT,

the coverage is the same for all tested peval > 0.These results challenge the previous practice of always set-

ting peval = 0 as in MRW, and show that for a significantnumber of domains a higher evaluation rate is suitable.

Figure 7: Coverage varying peval ∈ {0, 0.25, 0.5, 0.75, 1}.

4 Biased Action Selection for Random WalksIn the versions so far, Arvand-2013 used the heuristic com-putation only to obtain the h-value. Preferred operators canbe obtained from common heuristic functions at no additionalcost. Using them to guide random walks is the next step.

The GRIPPER example in [Nakhost and Muller, 2012]demonstrates that biased action selection can lower theregress factor, and greatly decrease the runtime of RW. MonteCarlo Helpful Actions (MHA) [Nakhost and Muller, 2009]bias action selection using information gathered from randomwalks. Q(a) scores are updated for each possible action a,and actions are sampled from a Gibbs distribution with tem-perature T . For current state s with applicable actions A(s),action a is chosen with probability

p(a, s) =eQ(a)/T∑

b∈A(s) eQ(b)/T

.

T determines the strength of the bias towards actions withlarger scores: lowering T gives less uniform distributions.

In the MHA implementation of [Nakhost and Muller,2009],Q(a) counts the number of times an action is preferredin the current search step. Q(a) is computed only from statis-tics gathered from endpoints of walks starting from the cur-rent state s. Preferred operators of intermediate states are notcomputed. Scores are reinitialized at every jump to anotherstate. Preferred operators of s itself are not treated separately,which seems counterintuitive: they could at least be given ahigher priority.

Arvand-2013 extends MHA to exploit the extra informa-tion from its more frequent state evaluations, and give higherpriority to current preferred operators, when known. Letn(a) be the number of times that a was a preferred opera-tor, N = maxa∈A(s) n(a), and PO(s) the set of preferredoperators in s. Then:

Q(a) =

{N ×W + n(a)(1−W ) if a ∈ PO(s)

n(a) Otherwise

The parameter W ∈ [0, 1] controls the relative weight ofthe operators in PO(s): larger W favor them more. In states

2340

Page 6: Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

where PO(s) is not computed and therefore the empty set, theresult is the same as in classical MHA.

Figure 8 shows the coverage of MHA, varying the temper-ature T and weight W , against the version without MHA. Inall runs, peval = 0.5, ALR(ε = 0.1) and AGR are used.

Let MHA(w, t) denote a version of Arvand-2013 as aboveenhanced with MHA, with W = w and T = t. Key observa-tions are:

• MHA(1, 10) is very effective. Coverage in BARMAN im-proves from 0 to 18 (90%), in TRANSPORT from 2 (10%)to 20 (100%), in ELEVATORS from 7 (35%) to 20 (100%)and in PARKING from 9 (45%) to 17 (85%). In total,MHA(1, 10) solves 56 more problems (20%).

• W = 1, using only the current preferred operators whenavailable, works best. MHA(1, t) consistently outperformsMHA(0.5, t), which outperforms MHA(0, t).

• With W = 1, for most domains lower temperatures ofT = 10 and T = 100 are preferable. Exceptions are PARC-PRINTER with T = 1 and TIDYBOT with T = 1000.

5 Comparison with Other PlannersTable 1 shows total coverage in IPC-2011 for the successiveversions of Arvand-2013, compared with the top three com-petition planners in terms of coverage, LAMA2011, FDSS2[Helmert et al., 2011] and Probe [Lipovetzky and Geffner,2011], as well as the RW planner Roamer. The last version ofArvand-2013 using MHA(1, 10) and peval = 1 is very com-petitive. Overall, it only lags behind LAMA2011. This ismainly due to the results in SOKOBAN: Arvand-2013 solves17 fewer problems in this domain, which is considered to behopeless for RW search [Xie et al., 2012].

Arvand-2013 Version solved Ref. Planner solvedBaseline 127 Roamer 215+AGR 149 FDSS2 220+ALR, peval = 0.5 164 Probe 226+ALR, peval = 1 156 LAMA2011 250+MHA, peval = 0.5 219+MHA, peval = 1 226

Table 1: Number of solved tasks out of 280 in IPC-2011.Left: Arvand-2013 versions. Right: IPC reference planners.

6 Conclusions and Future WorkThe systematic bottom-up reconstruction of the new RWplanner Arvand-2013 challenges several assumptions and de-sign choices made in previous systems, and shows that strongimprovements to RW systems are still possible. Two obser-vations stand out:

1. The importance of adaptive systems: this becomes moreimportant for search algorithms like RW search, which in-stead of systematically exploring all states, selectively sam-ple parts of the search space: the effective distribution ofsamples depends on the search space characteristics of theinput problem. ALR and AGR provide practical guidelinesfor developing such adaptive systems.

Figure 8: Coverage of MHA versus uniform action selection(no MHA), with T ∈ {1, 10, 100, 1000} and w = 0 (top),w = 0.5 (middle), w = 1 (bottom).

2. The big effect of action selection biasing: as the theory de-veloped in [Nakhost and Muller, 2012] predicts and exper-iments in Section 4 confirm, action selection biasing cansignificantly improve the performance of RW search. MHAis one successful example of a biasing technique.The latest version of Arvand-2013 is still relatively simple,

with much room for adding features, but already has strongperformance. Future work includes investigating heuristicfunctions other than hFF , using multiple heuristics in RWplanning, and tuning for plan quality as opposed to coverage.

2341

Page 7: Towards a Second Generation Random Walk Planner: An ... · This very basic RW planner already solves 126.6 of 280 IPC-2011 problems with the best tested fixed thresholds of t g =

References[Alcazar and Veloso, 2011] V. Alcazar and M. Veloso. BRT:

Biased rapidly-exploring tree. In The Seventh Interna-tional Planning Competition, IPC 2011, Universidad Car-los III de Madrid, pages 17–20, 2011.

[Alcazar et al., 2011] V. Alcazar, M. Veloso, and D. Borrajo.Adapting a rapidly-exploring random tree for automatedplanning. In Proceedings of the Forth Annual Symposiumon Combinatorial Search, SOCS 2011, Barcelona, Spain,2011.

[Bonet and Geffner, 2001] B. Bonet and H. Geffner. Plan-ning as heuristic search. Artificial Intelligence, 129(1-2):5–33, 2001.

[Coles et al., 2007] A. Coles, M. Fox, and A. Smith. A newlocal-search algorithm for forward-chaining planning. InProceedings of the Seventeenth International Conferenceon Automated Planning and Scheduling, ICAPS 2007,Providence, Rhode Island, USA, pages 89–96, 2007.

[Gittins et al., 2011] J. Gittins, K. Glazebrook, and R. We-ber. Multi-armed bandit allocation indices. Wiley, 2011.

[Helmert et al., 2011] M. Helmert, G. Roger, and E. Karpas.Fast Downward Stone Soup: A baseline for building plan-ner portfolios. In ICAPS 2011 Workshop on Planning andLearning, pages 28–35, 2011.

[Helmert, 2006] M. Helmert. The Fast Downward planningsystem. Journal of Artificial Intelligence Research (JAIR),26:191–246, 2006.

[Hoffmann and Nebel, 2001] J. Hoffmann and B. Nebel. TheFF planning system: Fast plan generation through heuris-tic search. Journal of Artificial Intelligence Research,14:253–302, 2001.

[Hoos and Stutzle, 2004] H. Hoos and T. Stutzle. StochasticLocal Search: Foundations & Applications. Morgan Kauf-mann Publishers Inc., San Francisco, CA, USA, 2004.

[Lipovetzky and Geffner, 2011] N. Lipovetzky andH. Geffner. Searching for plans with carefully de-signed probes. In Proceedings of the 21st InternationalConference on Automated Planning and Scheduling,ICAPS 2011, Freiburg, Germany, 2011.

[Lu et al., 2011] Q. Lu, Y. Xu, R. Huang, and Y. Chen. TheRoamer planner random-walk assisted best-first search.In The Seventh International Planning Competition, IPC2011, Universidad Carlos III de Madrid, pages 73–76,2011.

[Nakhost and Muller, 2009] H. Nakhost and M. Muller.Monte-Carlo exploration for deterministic planning. InProceedings of the 21st International Joint Conference onArtificial Intelligence, IJCAI 2009, Pasadena, California,USA, pages 1766–1771, 2009.

[Nakhost and Muller, 2012] H. Nakhost and M. Muller. Atheoretical framework for studying random walk planning.In Proceedings of the Fifth Annual Symposium on Com-binatorial Search, SOCS 2012, Niagara Falls, Canada,2012.

[Nakhost et al., 2011] H. Nakhost, M. Muller, R. Valenzano,and F. Xie. Arvand: the art of random walks. In The Sev-enth International Planning Competition, IPC 2011, Uni-versidad Carlos III de Madrid, pages 15–16, 2011.

[Nakhost et al., 2012] H. Nakhost, J. Hoffmann, andM. Muller. Resource-constrained planning: A MonteCarlo random walk approach. In Proceedings of theTwenty-Second International Conference on AutomatedPlanning and Scheduling, ICAPS 2012, Atibaia, SaoPaulo, Brazil, pages 181–189, 2012.

[Sutton and Barto, 1998] R. Sutton and A. Barto. Reinforce-ment Learning: An Introduction. MIT Press, 1998.

[Valenzano et al., 2012] R. Valenzano, H. Nakhost,M. Muller, J. Schaeffer, and N. Sturtevant. Arvand-Herd: Parallel planning with a portfolio. In Proceedingsof the 20th European Conference on Artificial Intelligence,ECAI 2012, Montpellier, France, pages 113–116, 2012.

[Xie et al., 2012] F. Xie, H. Nakhost, and M. Muller. Plan-ning via random walk-driven local search. In Proceedingsof the Twenty-Second International Conference on Auto-mated Planning and Scheduling, ICAPS 2012, Atibaia,Sao Paulo, Brazil, pages 315–322, 2012.

2342