44
DeJong, G., AI Can Rival Control Theory for Goal Achievement in a Challenging Dynamical System. Computational Intelligence, 1999. 15(4): p. 333-366. AI can Rival Control Theory for Goal Achievement in Some Difficult Dynamical Systems Gerald DeJong Computer Science Dept. and the Beckman Institute University of Illinois Urbana IL 61801 405 N. Mathews Urbana, IL 61801 Fax: (217) 244-8371 email: [email protected]

AI can Rival Control Theory for Goal Achievement in Some ...web.engr.illinois.edu/~mrebl/DeJong_99.pdf · AI can Rival Control Theory for Goal Achievement in Some ... human's foot

Embed Size (px)

Citation preview

DeJong, G., AI Can Rival Control Theory for Goal Achievement in a Challenging Dynamical System.

Computational Intelligence, 1999. 15(4): p. 333-366.

AI can Rival Control Theory for Goal Achievement in Some Difficult Dynamical Systems

Gerald DeJong

Computer Science Dept. and the Beckman Institute

University of Illinois Urbana IL 61801

405 N. Mathews Urbana, IL 61801

Fax: (217) 244-8371 email: [email protected]

Abstract

Nonlinear dynamical systems are notoriously difficult to control. The ACROBOT is an under-actuated double pendulum in a gravitational field. Under most driving schemes the ACROBOT exhibits chaotic behavior. But with carefull applications of energy it is possible to gradually pump the system so as to swing it over its supporting joint. This swing-up task is of current interest to control theory researchers.

Conventional notions of AI planning are not easily extended to domains with interacting continuously-varying quantities. Such continuous domains are often dynamic; important properties change over time even when no action is taken. Noise and error propagation can preclude accurately characterizing the effects of actions or predicting the trajectory of an undisturbed system through time. A plan must be a conditional action policy or a control strategy that carefully nudges the system as it strays from a desired course. Automatically generating such plans or action strategies is the subject of this research.

An AI system successfully learns to perform the swing-up task using an approach called Explanation-Based Control (EBC). The approach combines a plausible qualitative domain theory with empirical observation. Results are in some respects superior to the known control theory strategies. Of particular importance to AI is EBC's notion of a "plan" or "strategy" and its method for automatic synthesis. Experimental evidence confirms EBC's ability and generality.

Keywords: Explanation-Based Learning, Qualitative Reasoning, Intelligent Control, Double Pendulum, Problem Solving, Dynamical System Control

1

Introduction

Conventional approaches to AI planning have difficulty with simultaneous and continuous changes of interacting quantities. Yet such changes are central to many real-world domains. We have advanced Explanation-Based Control (EBC) (DeJong 1994) as a paradigm for automated planning and goal achievement in such domains. Our previous work was exercised on only relatively simple tasks. In this paper we demonstrate that the EBC approach applies to a particular complex dynamical system of current interest to control theory researchers. In addition, we show that EBC yields solutions that are comparable and in some respects superior to control theory solutions.

Unlike conventional AI planning, EBC includes machine learning and quantitative estimation as crucial components. Furthermore, EBC is unlike conventional control theory in that it relies heavily on inference and symbolic representations. In these differences lies its strength. Some tasks that are difficult to address in the other paradigms can be handled quite naturally in EBC. However, EBC complements rather than supplants the other paradigms. For a domain that can be accurately modeled with linear differential equations, for example, conventional control theory is the paradigm of choice; the added complications of EBC are not warranted.

We examine the swing-up task of an underactuated double pendulum in a gravitational field. This is one of the standard classical mechanisms in control theory, exhibiting a broad range of behaviors. Some linearizable modes are relatively easy to control while others involve intrinsic nonlinearities and are extremely difficult. The driven double pendulum is perhaps the simplest mechanism that exhibits chaotic behavior.

The system is illustrated in Figure 1. The robot, affectionately known as the ACROBOT, consists of two planar links. A torque can be applied at the second joint, q , but the first joint, q , is free-swinging. The task is to find a torque strategy for q that excites increasingly energetic swings, finally resulting in the ACROBOT swinging over its undriven q joint. Until recently, this was an open problem in control theory. There are now several known control theory solutions (Spong 1994). We show that among these approaches, the EBC solution exhibits the best time behavior and greatest tolerance for unmodeled dynamics.

2 1

2

1

Control theory has met with a great deal of success in real-world applications. EBC is not proposed as a replacement for these standard applications. However, the tools of conventional control theory do not easily extend to ill-defined and poorly understood problems. EBC and the AI paradigm of symbolic reasoning hold great promise for constructing control strategies for

2certain difficult-to-analyze complex dynamical systems. While the EBC system described in this paper is autonomous, the EBC approach also holds promise for a human/computer collaborative paradigm for control design.

Controlling Dynamical Systems

A dynamical system is composed of a number of interacting world components. Such a system is modeled as a set of real-valued quantities. The mathematical interactions among the model's quantities mirror the world's causal interactions. The controls are world items that can be directly manipulated. We call their corresponding quantities "control inputs". The values of these quantities can be changed at will. Other inputs are exogenous; their values are provided externally. The remaining quantities react as dictated by causal couplings of the world.

A controller is a mediator between a user and the target dynamical system to be controlled. Target systems that are difficult to handle by themselves can often be made easier to deal with by designing an appropriate controller. Any user instructions can be communicated to the controller in the form of a command input. The controller selects the actual input values to be applied to the target system. For example, an automobile's anti-lock braking system is a controller. It selects and applies the actual hydraulic brake pressure in response to the command input, the human's foot pressure on the brake pedal. The controller can achieve much more effective braking than can a typical unaided human driver. Another example is the fly-by-wire controller in most modern jet aircraft. Here the pilot's commands are automatically combined by the controller with data encoding the current altitude, air speed, barometric pressure, etc., to yield a more efficient and effective manipulation of the flight surfaces. Controllers can accomplish superhuman feats. For example, a smoother flight can be achieved in a larger airliner by a controller that senses turbulence at the nose and without pilot intervention, deflects flight surfaces to minimize the bumps when the wings enter the disturbed air.

Control theory is the science of constructing such controllers. The most powerful control theory methods work forward from a differential equation model of the target system. Conventional control theory can be viewed as exploiting a great deal of prior knowledge about the target system in the form of these differential equations. Differential equations (like those of Table 1) precisely describe a system's behavior over time. For a large and useful class of mechanisms, control theory techniques allow one to construct a controller. These techniques are far from automatic but they guide the control engineer through the most important decisions. Furthermore, they can often guarantee that the target system will possess some desired properties

3such as tracking changes in the command input quickly, exhibiting small system overshoot to changing conditions, quickly dampening oscillations introduced by transients, and so on.

However, control theory faces several limitations. Many physical processes within a target system are not sufficiently well understood to support complete modeling with differential equations. Common phenomena like the deformation of solids, the flow of less-than-ideal fluids, and even friction are imperfectly understood and can only be approximated by current quantitative models.

Furthermore, the guarantees of behavior offered by control theory apply only to the differential equation models and not to the actual physical systems that they describe. If the differential equations represent an approximation or simplification of the physical system (and this is almost always the case), there is no formal guarantee of good behavior for the real-world target system. The equations in Table 1, for example, include no frictional losses, no air resistance, no bending of the links, and no interaction of the system's movement with the spinning of the Earth. Nevertheless, these phenomena are unavoidable in the real world and often quite significant. Such simplifications are the differential equation analog of the well-known qualification problem in predicate logic (McCarthy 1980).

Even when an accurate differential equation model is available, the form of the equations may preclude the use of conventional control theory techniques. These techniques have been most successful at controlling linear systems. Thus, the differential equation model must often be linearized before the control techniques can be applied. For many intrinsically nonlinear systems, such linearization is not possible. Even in cases where an approximating linearization can be constructed, the approximation process further magnifies the difference between the represented model and its real-world counterpart. This discrepancy exacerbates the qualification problem mentioned above and further distances the real world system from the formal guarantees of the model.

Explanation-Based Control

EBC avoids the quantitative inflexibility of differential equations. Our approach employs background knowledge of a more abstract form. The prior knowledge for EBC consists of a set of plausible symbolic statements about the target system using a qualitative vocabulary (Forbus 1984; Bobrow 1985; Kuipers 1986). It is easier and perhaps more natural for the user to specify qualitative relations (e.g., that an automobile's speed tends to decrease when one pushes on the

4brake pedal) than to commit to a specific differential equation model describing the behavior under all conditions.

Of course, qualitative axioms are considerably less informative than a differential equation model would be. Only more general qualitative features can be inferred. In EBC the inferential process is further weakened by considering the qualitative axioms as only plausible. The syntax adopted most closely follows (Forbus 1984), but the semantic interpretation is unique (compare (DeJong 1994) with (Crawford, Farquhar et al. 1990)). The plausibility of the axioms means that inferred behavior is not entailed but only suggested. Since the underlying dynamical system may be extremely complex and subtle, we assume that it cannot be perfectly described even qualitatively.

EBC, like other explanation-based approaches, must be given one or more examples of the target system's desired behavior. These examples are qualitatively explained using EBC's background domain theory to suggest plausible qualitative control strategies in an explanation-based fashion (DeJong 1997). The newly constructed qualitative strategy is quantitatively tuned and then either accepted or rejected by empirically monitoring the actual target system's behavior in response to the strategy. In this way, EBC's prior knowledge and observed examples are combined to synthesize a control strategy.

The mechanism for constructing explanations follows standard and sound logical inference, though the inference procedure cannot be of a refutational type. The qualitative characterization of the input quantities together with the qualitative statements of the prior knowledge can be thought of as a kind of axiom set. The observed qualitative characterization of the output behavior is the "theorem" to be proven. Any derivation of the observed behavior from the axioms constitutes a plausible explanation. Each distinct explanation for an observed behavior advances a different underlying causal conjecture of how the inputs resulted in the observed outcome. Since the axioms are only plausible, the explanation (though entailed by the axioms) might not hold. Indeed, it is possible to explain many behaviors that the real world could never exhibit.

The form of an explanation is a causally annotated sequence of qualitative states through which the target system is hypothesized to pass. Each state is qualitatively homogeneous; within such a state all relevant qualitative relations persist. A state transition is signified by any change in a relevant qualitative relation (a proposition's beginning, ending, or changing truth value). This view is quite conventional in qualitative reasoning (Weld and deKleer 1990).

5Each distinct explanation embodies a causally distinct account and corresponds to a distinct

method of exerting control over the target system. Thus, the set of explanations or derivable behaviors defines a space of possible qualitative control approaches. This space is searched empirically: The most plausible hypothesis that qualitatively accounts for all of the known real-world data is selected first. For us, the most plausible hypothesis is the one with the simplest explanation -- the least complex plausible derivation. A quantitative control strategy is constructed for the selected hypothesis. The strategy is applied to the target system which responds in its real-world way resulting in additional observations. If these new data are qualitatively inconsistent with the current explanation, that explanation is abandoned. The next most plausible explanation is the simplest one consistent with the observed world behavior. This next qualitative explanation gives rise to an alternative quantitative control strategy and so on. Rickel and Porter (Rickel and Porter 1997) established that qualitative explanations can be generated effectively in simplest-first order. Testing a qualitative explanation for consistency with observations is also efficient. The process of generating and testing control strategies repeats until either a) some control strategy empirically proves itself to deliver acceptable control, or b) the hypothesis space is exhausted. In the latter case, the mechanism is uncontrollable by an EBC strategy; the system's prior domain theory is simply inadequate to explain the requisite range of world behaviors.

The ACROBOT

To control the ACROBOT, an input torque can be applied to the elbow joint (q ) but not the shoulder joint (q ) which is free-swinging. Any rotation at q is entirely the result of dynamic coupling from the rest of the system. Thus, the single input quantity for this system is the torque at . In addition to torque input, our simulator permits velocity and position control of the elbow joint (q ). With velocity input, the torque is computed by a simple first-order PD controller. With position input a simple second-order PD controller is used.

2

1 1

q2

2

2q

1q

cmq

center of mass

l1

l2 gravity

Figure 1: The Acrobot

cml

Link 1 is of length l1, mass m1, and center of mass lc1; and inertia I1; similarly for link 2. The equations of motion of the system adapted from (Spong and Vidyasagar 1989) are given in Table 1.

6

Some aspects of the ACROBOT system are well understood. Balancing vertically at q =0, =0, while more challenging than the

conventional pole-balancing task, is comparatively easy. Designing such a controller is studied in most first year graduate control theory courses. Other aspects of the ACROBOT are more subtle. For example, the point q =0, =0 is not the only unstable equilibrium point. The condition of unstable equilibrium is that the system's center of mass be controllably directly above the q pivot. The set of these q , angle values form a manifold. Maintaining balance while moving along the unstable equilibrium manifold was an open problem until 1992 (Bortoff and Spong 1992). Other aspects of the system are still more challenging. Some are difficult even to pose in the

traditional control theoretic framework of regulation or tracking.

1

q2

1 q2

1

1 q2

cm

011212111 =+++ φhqdqd &&&&

τφ =+++ 22222112 hqdqd &&&&

where

)sin()sin()sin()(

)sin(

)sin(2)sin(

))cos((

))cos(2(

21222

2122112111

2122122

1222122222121

222122212

222222

2122122

212

21111

qqglmqqglmqglmlm

qqllmhqqqllmqqllmh

IqlllmdIlmd

IIqllllmlmd

c

cc

c

cc

cc

c

ccc

+=+++=

=

−−=

++=

+=

+++++=

φφ

&

&&&

Table 1: The Acrobot Equations of Motion

One less-well-understood double pendulum problem is the swing-up control of the ACROBOT. Here, the task is to pump energy into the system. A strategy must be found for driving q , the controllable joint, so as to excite oscillation of q . The goal is to swing angle q past the

vertical (see Figure 2).

2

1 cm

The swing-up task is difficult for several reasons. First, any solution involves periodic driving that switches between different control movements. Such periodic excitation often leads to uncontrollable behavior. Indeed, in the ACROBOT system most periodic excitation of q leads quickly to a chaotic regime from which control would be difficult or impossible to establish. Second, any solution requires excursions well into a highly non-linear portion of the system's phase space. Small angle swings result in the well-known constant period oscillation. But as energy is pumped into the ACROBOT, undergoes wider swings. If the maximum deflection of q approaches 0, the period of

oscillation approaches infinity. Such nonlinearity is notoriously difficult to handle. Third, with wide oscillations come high velocities as the system's energy is shuttled from potential to kinetic and back. High velocities result in high frictional losses.

2

q1

Figure 2: The Swing-up Goal

center of mass

7

the negatdeflection

Figure 4:for th

Figurethe behav

Figure 3: HC Solution Trace (input to the EBC system)

Controlling the ACROBOT

The first solution to the swing-up problem, called HC for Heuristic Control, works by analogy to a child pumping a swing (Spong 1995). The basic idea is to drive the second link so as to inject energy "in phase" with the swinging of the first link. As approaches 0 from the positive side (i.e., q reaches a positive extreme), the control input, -desired, is set to a negative constant. Thus, is driven clockwise as q ends its counterclockwise swing. This control input is continued until q reaches some maximum deflection angle -α. The control input is then set to 0 until q reaches its maximum clockwise deflection ( q approaches 0 from

ive side). At this point -desired is set to the positive constant until q achieves a of +α and the process repeats.

1q&

1

2q&q2 1

2

2q& 1

1&

2q& 2

Final EBC Solution Trace e Problem of Figure 3

s 3 and 4 are screen dumps of captured traces from the simulated ACROBOT. They show ior of some of the relevant variables as a function of time. The system variables

8illustrated include the two ACROBOT angles (labeled Q1 and Q2), their angular velocities (labeled Q1-DOT and Q2-DOT), the computed angle from the shoulder to the center of mass (CM-Q) and its angular velocity (CM-Q-DOT), the computed distance from the shoulder to the system center of mass (CM-L), its time rate of change (CM-L-DOT), the composite energy of the system (ENERGY) which is decomposed into potential (POT-E) and kinetic (KIN-E), and the system clock showing a constant increase in time. The goal is reached when the ACROBOT 's center of mass is directly above the q pivot point, that is when q =2πi for any integer i. In both Figures, the amplitude of q (CM-Q) increases until the goal is reached. The goal

achievement point is circled in the Figures and corresponds to a normalization in the CM-Q output display from 0 to 2π. In both cases the goal was achieved during a clockwise swing. Achieving the goal during a counter-clockwise swing would be signified by a downward CM-Q normalization step (from 2π to 0).

1 cm

cm

Figure 3 shows the captured HC solution. It consists of 1300 clock ticks. The solution involves seven control actions alternately driving the velocity of q (Q2-DOT) positive and negative for prescribed intervals. The system's energy is shuttled between potential and kinetic as the double pendulum swings; the total energy gradually increases until the goal is achieved. For comparison, Figure 4 provides EBC's solution to the same problem. The goal is achieved in a bit less than half the time with five control actions. The construction of the EBC strategy will be discussed in a moment. First we turn our attention to the HC strategy.

2

There is no formal proof that the HC strategy works. Whether such a proof is possible is an open question. In the terminology of control theory it is heuristic due to the lack of any guarantee that it will achieve its goal of swinging up the ACROBOT. The difficulty in analysis stems from discontinuities in the control manipulation. In practice, the HC strategy seems to work well under a broad range of initial configurations and is quite robust with respect to small changes in simulator parameters.

HC is also heuristic from the point of view of AI planning. It was not produced automatically nor according to any formal reasoning procedure. One can, however, offer a number of informal justifications for why the HC strategy achieves the goal. Indeed, the automatic construction of such a plausible justification is at the heart of the EBC strategy.

EBC's background domain theory is given in Table 2. The conventional inference rules, R5-R7, are simply definitions of notation. Rule R5, for example, says that a quantity is increasing if and only if its time derivative is positive. Rule R7 defines a quantity qs (which we use to

denote the angular speed of the center of mass about the shoulder joint) as the absolute value of cm

9the its angular velocity. The meaning of these sentences is conventional; they are each double implications that are always respected by the world.

The plausible processes and plausible inference rules are more interesting. Each process has an initiation condition (IC), a maintenance condition (MC), and a body. To be active, the initiation and maintenance conditions must be simultaneously true. The process is then active and independent of its IC for the interval over which the maintenance condition persists. During that time interval the system is free to plausibly believe any item from the body of the process. It may also ignore any item from the body. If no item from the body is employed in a qualitative explanation then that process is irrelevant to the explanation. If any item contributes to an explanation then its process is relevant to the explanation.

There are ten processes in all. The first four processes describe the qualitative behavior of the ACROBOT's center of mass as acted upon by the gravitational force. The SUR process is typical of these. It describes the ACROBOT 's center of mass Swinging Up on the Right (i.e., rotating counterclockwise against gravity) as its kinetic energy is transformed to potential energy. The initiation condition for this process is that the center of mass be directly below the shoulder joint (qcm = π ). The maintenance condition requires that the angular velocity of the center of mass

system be positive. This means that the center of mass must be rotating counterclockwise around the shoulder joint. While the counterclockwise rotation continues, the system may make use of the statements in the body of SUR to justify desired conclusions. The body contains three statements. The first suggests that the angular velocity is decreasing. The second suggests that the angle to the center of mass is increasing. The third indicates that the expected duration of the process is positively influenced by the initial angular speed of the center of mass. In other words, it is plausible to believe under the conditions of IC and MC that the rotational movement will continue but at an ever decreasing rotational velocity. When and if the rotation slows to a stop, the process's maintenance condition becomes violated and its body's statements are no longer plausibly supported by the process.

Four additional processes describe the extension and retraction of the center of mass as q is rotated. For example, the process Lengthen-ccw suggests that so long as q is negative but increasing (i.e., the elbow joint is bent in a clockwise direction but is rotating counterclockwise so as to decrease the bending angle, the distance from the shoulder pivot to the system's center of mass is increasing. Note that the initiation and maintenance conditions are the same. There are four statements in the body. The first says that the distance from the shoulder pivot to the center of mass is increasing. The second, more interesting statement, says that the energy of the ACROBOT system may plausibly be decreasing. There are several ways to think about why this is

2

2

10so. First, we might consider the forces at the q joint: If the ACROBOT is revolving about its shoulder joint, centrifugal force tends to increase the distance from the center of rotation (q ) to the system's center of mass. The torque at q must resist this force for the q angle to remain constant. If the torque at q is such that the arm is straightening in a controlled way, that torque is opposite in sign to the motion of q . Thus, the work done is negative and the energy level of the system is decreasing. Alternatively, to get a more intuitive understanding, consider twirling a weight around one's head at the end of a string. As the weight circles faster and faster one must grasp the string more tightly to keep it from flying away. If one allows the string to be paid out, work could be done, perhaps heating up one's hands from the friction of the sliding string. This work is energy that used to be stored in the rotating system. Thus, as the center of mass (the weight) gets farther away, the rotating system possesses less energy. The final two statements in the body say that energy is lost more quickly as the angle q changes more quickly and also as the speed of the system's rotation increases.

2

1

2 2

2

2

2

The final two processes are control processes. They specify what can be plausibly inferred when control commands are issued. One process corresponds to commanding to rotate counterclockwise. The other is for clockwise rotation. HC, the expert's training solution, employs velocity commands. To explain this behavior the control processes are in terms of desired velocity commands to the q joint. The commands are serviced by a simple PD controller. The desired torque is computed from the proportional and derivative errors between the current and desired angular velocity. The torque applied in the simulator is the minimum between this torque and the maximum that the simulated motor is capable of delivering. The only condition needed to activate, for example, q2-ccw is to command the q joint velocity to be a positive number. For as long as the condition persists, it is plausible to believe that the q angle is increasing and that the energy of the system is increasing.

q2

2

2

2

11Swing Up Right IC: qcm = π

MC: inc qcm( )Body: ( )cmqdec &

inc qcm( ) ( )icmqsDurQ ,,+

Swing Up Left IC: qcm = π

MC: dec qcm( )Body: ( )cmqinc &

dec qcm( ) Q+ Dur,qscm ,i( )

Swing Down Right IC: 0=cmq&

MC: qcm > 0Body: ( )cmqdec &

dec qcm( ) Q+ Dur,qcm,i − π( )

Swing Down Left IC: 0=cmq&

MC: qcm < 0Body: ( )cmqinc &

inc qcm( ) Q+ Dur,π − qcm,i( )

Lengthen-ccw IC: inc q2( )∧ q2 < 0MC: same as IC Body: inc lcm( ) dec E( ) ( )2,qEQ &&

( )cmqsEQ ,&−

Shorten-ccw IC: inc q2( )∧ q2 > 0

MC: same as IC Body: dec lcm( ) inc E( ) ( )2,qEQ &&

+

( cmqsEQ ,&+ ) ) )

Lengthen-cw IC: dec q2( )∧ q2 > 0

MC: same as IC Body: inc lcm( ) dec E( ) ( )2,qEQ &&

( cmqsEQ ,&−

Shorten-cw IC: dec q2( )∧ q2 < 0

MC: same as IC Body: dec lcm( ) inc E( ) ( )2,qEQ &&

+

( cmqsEQ ,&+

q2-ccw IC: control ( 02 >q& )MC: same as IC Body:

inc q2( )inc E( )

q2-cw IC: control ( )02 <q&MC: same as IC Body:

dec q2( )inc E( )

R1: ⇐ ? QC: TC:

q = ? v;?i0

dec(? q);?i1?q > ? v;?i2overlaps(?i0,?i1)overlaps (?i1,?i2 )before(?i2 ,? i0 )

R2: ⇐? QC: TC:

q = ? v;?i0

inc(?q);? i1?q < ? v;?i2overlaps(?i0,?i1)

overlaps (?i1,?i2 ) before(?i2 ,? i0 )

R3: ⇐? QC: TC:

q > ? v;?i0

inc(?q);? i1?q = ? v;?i2overlaps(?i0,?i1)overlaps (?i1,?i2 )

before(?i0 ,?i2 )

R4: ⇐ QC:

0?;?? ivq <dec(? q);?i1?q = ? v;?i2

TC:

)?,(? 10 iioverlapsoverlaps (?i1,?i2 )

before(?i0 ,?i2 )

R5: inc Definition ( ) 0?? >≡ qqinc &

R6: dec Definition ( ) 0?? <≡ qqdec &

R7: angular speed Definition

cmcm qqs &≡

Table 2: EBC's Qualitative Domain Theory for the ACROBOT

There are four plausible mechanism-independent inference rules, R1-R4. Each has a quantity condition (QC) which must hold for the specified values and behavior of the world quantities and a temporal condition (TC) specifying requisite time relations among their intervals. When the conditions are met, the consequent may be plausibly believed. For example, R1 says that a quantity that is greater than some specified threshold value and is decreasing may reach that threshold.

A rule, such as R1, may at first seem superfluous or at least confusing. It says that a threshold value may be reached by a decreasing quantity, but allows also that the value might not be

12reached. Since the value in question is always either reached or not, what is the contribution of R1? After all, important thresholds such as process initiation conditions require that a quantity reach a threshold value; it is not sufficient to infer that it may reach the threshold.

Recall that the inferencer is invoked only to build explanations, and that explanations may only be constructed to account for observed behavior. An initiation condition cannot be inferred to hold by R1. However, once a condition is observed (or otherwise believed) to hold, R1 offers a suggestion how this significant state of affairs might have been brought about. It focuses attention on if and why the quantity might have been decreasing and why it was greater than the threshold value to begin with. If these in turn can be justified and traced ultimately to known control manipulations, then the system has successfully suggested an explanation for how the quantity was controlled to reach the threshold condition.

We wish to discuss briefly but more generally the notion of a plausible statement. The processes and plausible inference rules are represented as syntactically dressed up implications. However, we cannot interpret them as conventional implications. Consider the first statement in the body of q2-ccw, that q is increasing in value. It might seem to follow from the process's conditions, but in fact it does not. It is quite consistent to command the simulator to rotate counterclockwise while observing q to rotate clockwise: the PD gains may be insufficient to achieve the desired angular velocity or the torque required may be beyond what the motor can deliver or the momentum of a previous fast clockwise rotation may not yet have dissipated, and so on. The actual conditions that entail a counterclockwise rotation, like the conditions of most other behaviors, are exceedingly difficult to know even for the simulator. They are often impossible to know for the real world.

2

q2

2

What does it mean for a statement to be plausible? Such statements certainly cannot be given the standard interpretation of a constraint on possible worlds. For us a plausible statement has meaning only within the context of an explanation-based system, and only during the explanation phase. Plausible inference is allowed for the sole purpose of justifying (directly or indirectly) a behavior that has been observed to occur in the world. Thus, q2-ccw cannot sanction the inference of inc based on control commands alone, but if inc is first observed in the world during some interval, plausible inference may be used to conjecture why it occurred. In particular, q2-ccw may be used to explain the observation during any subinterval in which

(q2 ) (q2 )

Ý q is commanded to be positive. If in the example being explained, the control input 2 Ý q is not commanded to be positive, q2-ccw cannot offer any plausible account for the observed

and an alternative causal justification must be found. Of course, any plausible inference may be wrong for the world situation even though it is sanctioned. It is possible that the control

2

inc(q2 )

13input Ý is the sole significant cause of the increasing q angle. But the command may be only a contributing factor among several significant influents on q which together result in its increasing. It is also possible that the control input contributes only insignificantly to the q increase; its small contribution may be swamped by other much larger influences in the system.

q 2 2

2

2

Different explanations typically yield different generalizations. The set of all plausible explanations for an observation defines, via EBL, the set of all generalized causal accounts of the behavior. This is the hypothesis set of qualitative control strategies. Each distinct hypothesis suggests which facets of the system are significant for the phenomenon. All others are dismissed, assumed to be insignificant. The explanation conjectures how the significant quantities are to be qualitatively manipulated to achieve the qualitative outcome.

This space of hypotheses is empirically searched. A selected qualitative hypothesis is calibrated using observed quantitative data into a quantitative control strategy. Additional world goals are attempted with this strategy. World behavior consistent with its qualitative predictions are interpreted as evidence that the conjectured causal analysis models all of the significant factors of the world well enough to allow control. Conversely, inconsistent behavior is evidence against the hypothesis. If world evidence rejects the current hypothesis, another explanation is generated. As described earlier this is the simplest explanation consistent with the observed data. A new qualitative hypothesis is constructed, calibrated and tested.

The EBC Solution

The EBC solution to the problem of Figure 3 is shown in Figure 4. It is shown on the same scale as Figure 3. Because of the more efficient injection of energy, the total time is significantly less than for the HC solution. Note also that some control actions are more extreme than in the HC solution.

There are five steps to the automatic construction of an EBC strategy: observe, explain, generalize, optimize, and quantify. We now describe the step-by-step construction of the EBC strategy beginning with observing the training example. To better exhibit aspects of the numeric learning, the simulator's friction parameter is set to the 86% level. The friction parameter is discussed in more detail in the experimental section. For now it suffices to know that this is a relatively high value resulting in a fairly difficult control problem.

In the first step, as in many explanation-based approaches, we require at least one positive example which is Figure 3. Such traces consist of a large number of sampled points. Each point is a vector of instantaneous values of "observable" quantities at the sampled instant in time. To

14be "observable," a quantity's value must be knowable to the system. In the simulator these are just the current values of computation variables. For a physical mechanism these correspond to quantities that can be directly observed through sensors (as q and could be) or that can be efficiently computed from sensor values (like the energy, E). The HC solution of Figure 3 consists of 1,300 sampled points. EBC verifies that the goal is achieved. From inspection of the HC trace it is seen that q does indeed swing over the top at time tick 1241.

2 1q&

cm

The second step is to construct an explanation which plausibly concludes that the total energy of the system is greater at time 1241 than it was at time 0. It does not attempt to justify that the energy is quantitatively sufficient to swing over the ACROBOT. The explanation must be rooted in the initial configuration of the ACROBOT at time 0 and the observed control manipulations of

over the interval [0,1241]. Furthermore, all of the observable subgoal conclusions must themselves be derivable and not directly inconsistent with observation. It is not plausible, for example, for the explanation to require quantity to be increasing over an interval in which its

observed behavior supports only the label of decreasing.

2q&

cmq&

Figure 5a illustrates the intervals in which each monitored quantity can be considered to be increasing or decreasing. An increasing interval for a quantity is the maximal interval in which the quantity exhibits an increase somewhere in the interval of at least a minimum amount ε while not decreasing by more than ε anywhere in the interval. The system is not very sensitive to the choice of ε, although the explanation procedure fails if ε is small compared to the noise in the captured signals. A decreasing interval is similarly defined. Increasing intervals for a quantity are represented in Figure 5a by a dashed line above the quantity's trace; decreasing intervals by a dashed line below it. Increasing and decreasing intervals may overlap. At these times the inferencer may plausibly consider the quantity to be either increasing or decreasing. Within an explanation the inferencer is not allowed to simultaneously assign multiple inconsistent labels, however. Likewise the inferencer is free to ignore as insignificant any qualitative behavior not needed in the explanation.

15There are 125 increasing/decreasing temporal intervals supported in the HC solution. These

are computed from the trace after the goal is verified. In addition to increasing/decreasing subgoals, the inferencer makes use of several binary qualitative relations, notably greater-than and less-than. The confirmation of these plausible subgoals are provided on demand by inspecting the quantitative input trace. Any required process must be plausibly active. This is justified by plausibly deriving the initiation and maintenance conditions over the required intervals. Figure 5b illustrates the intervals over which the eight generic processes of Table 2 can be instantiated during the HC solution. An instantiated process offers a method of causally justifying the statements in its body over the instantiation interval; the body is justified if the process's preconditions can be plausibly justified. There are 35 separate instantiations of the EBC processes. Collectively, these offer 108 temporal qualitative axioms which are conditionally justified based on their respective process preconditions.

The final explanation consists of a sequence of interleaved intervals of increasing and decreasing energy with the amount of energy gain in the increasing ones plausibly greater than the energy lost in the decreasing ones. The full explanation is quite large and tedious. However, it is instructive to examine two important portions. The first portion we examine is shown in Figure 6. For simplicity, no temporal interval information is shown although temporal

Figure 5a) Trace with qualitative annotations

Figure 5b) Intervals of process activation

Figure 5: Intervals for the HC Solution Explanation

16information is present and crucial to the plausible qualitative explanation. This portion accounts for the plausibility of 14 distinct intervals alternating between increasing and decreasing energy. Note that the explanation does not account for all of the observed energy changes illustrated in Figure 5a. The other qualitative changes are implicitly conjectured to be insignificant for the purposes of control. If effective control cannot be established through this analysis then the explanation will ultimately be rejected. In fact, however, this explanation leads to the EBC control strategy whose trace is illustrated in Figure 4.

R 1

q2cw

R 4

Lengthen cw

dec q2( )

dec(E)

q2 ccw

R 2

R 3

Shorten ccw

Lengthen ccw

inc q2( )

q2 = 0

dec(E) inc(E)

q2 > 0q2 < 0)(

dec(E)

Shorten ccw

q2 = 0

inc(E)

q2 < 0

q2 ccw

Lengthen ccw

inc q2( )

t iti+4

ti+3t

i+2t

i+1

Figure 6: Plausible Explanation of HC's Increasing and Decreasing Energy

The increasing energy intervals are attributed to the Shortening processes. The initiation and maintenance conditions of these processes are subgoals which are themselves plausibly justified. The pattern of Figure 6 is repeated through the HC solution. At this point in processing it is not recognized as a repeated pattern, however.

Figure 7 shows another important portion of the explanation. It accounts for the qualitative oscillation of the ACROBOT's center of mass as a lengthy sequence of individual instances of SDL, SUR, SDR, SUL, SDL, SUR, etc. Again temporal interval information is omitted. This sub-explanation supports two plausible conclusions: 1) that the q movement is alternately

increasing and decreasing as the system swings first counterclockwise and then clockwise, and 2) that the speed of rotation qs grows to a maximum at the end of the swing down intervals

and slows to zero at the end of the swing up intervals.

cm

cm

Ý q cm = 0

qcm < 0

inc Ý q cm( )

inc qcm( )S D L

R 3

Ý q cm > 0

R 5

R 2

S U Rqcm = 0

inc qcm( ) dec Ý q cm( )

inc qcm( )

R 1

R 3

Ý q cm = 0

qcm > 0

S D R

dec Ý q cm( )

dec qcm( )

R 4

R 6

R 1

S U Lqcm = 0

R 2

R 4

Ý q cm < 0 dec qcm( ) inc Ý q cm( )

dec qcm( )

Ý q cm = 0

qcm < 0

inc Ý q cm( )

inc qcm( )S D L

R 3

R2

Figure 7: Plausible Explanation of HC's Oscillation

The explanations of Figures 6 and 7 are cross linked. This interaction provides a plausible account of why the energy of the system increases on each swing cycle. It is observed that less energy is lost than gained in the Lengthen/Shorten pair derived from the first q2 control. The same is true for the second and all succeeding Lengthen/Shorten pairs from succeeding q2

17manipulations. The plausible justification of this observation leads to the qualitative control strategy.

The links between the above explanations conjecture that the energy inequality is due to the timing among the procedures: A q2 control manipulation commences at the beginning of each Swing-Down process. Each such q2 manipulation gives rise to a Lengthen followed by a Shorten process instance. The speed of q , , is plausibly increasing

during each Swing-Down. Thus, it is plausible that qs , the angular speed, is less during the

Lengthen instance than during the immediately following Shorten instance. Thus, the energy gain is plausibly greater than the energy loss since the rate of change of energy is plausibly related to through the

Lengthen/Shorten body statements. This is independently derived for the Swing-Down-Left and Swing-Down-Right processes and for each Lengthen/Shorten pair derived from a single q2 control.

cm qscm

cm

qscm

SDR Leng-CW

q2-CW

SDR Short-CW

q2-CWSDR SUL

SDLSURSDL

Short-CCW q2-CCW

SDL Leng-CCW

q2-CCW

SDR Leng-CW

q2-CW

SDR Short-CW

q2-CW

SDR SUL SDL

SUR

SDL Short-CCW

q2-CCW

SDL Leng-CCW

q2-CCW

SDR Leng-CW

q2-CW

SDR Short-CW

q2-CWSDRSUL

SDL SURSDL

Short-CCW q2-CCW

SDL Leng-CCW

q2-CCW

SDR Leng-CW

q2-CW

SDR Short-CW

q2-CWSDRSULSUCCESS

Figure 8: The Significant Qualitative State Transitions

The third step is to generalize the explanation. The major result of this step is to recognize the cyclic nature of the lengthy linear explanation constructed in the previous step. In general, loop detection is difficult. EBC affords an effective simplification that allows comparatively efficient loop detection, however. Figure 8 shows schematically the significant qualitative states that the ACROBOT is believed to transition according to the plausible linear explanation. A significant qualitative state is defined to be the set of simultaneous qualitative processes that contribute statements from their bodies to the explanation. The sequence of such states is easily constructed from the explanation.

The repetition of a sequence of significant qualitative states is easily detected. The recursive structure of the linear explanation is not proven as would be necessary to deduce a loop (Shavlik 1990). However, the results need only be plausible and it is plausible to conjecture a loop from

18such a repetition. The conjectured repetition of qualitative states accounts for all of the significant qualitative features of the observed HC data. It is shown in Figure 9.

The fourth step is to qualitatively optimize the control strategy so as to enhance the features that lead to goal achievement. This task is also relatively easy given the explanation structure. The relevant justification is that energy is increasing over the cycle of Figure 9. There are two components: the energy loss is to be minimized while the energy gain is maximized. Energy loss stems from the lengthening of the ACROBOT while energy gain is attributed to shortening. According to the explanation, energy loss is minimized if the action takes place when the angular speed, qs , is slowest. The angular speed is zero at the transition from swing-up to swing-down

in both left and right parts of the cycle. Thus, the q2 command action for the two Lengthen intervals is scheduled to be centered at the point where is zero at the top of each portion of

the swing. To maximize the energy gain, the Shorten intervals are to be compressed as small as possible and centered on the point of greatest angular velocity. From the qualitative explanation of Figure 7 this is the point at which the q swings through the vertical as Swing-Down changes to Swing-Up. In addition, the q deflection angle, α, is chosen to be the maximum possible. These changes are illustrated in Figure 10.

cm

qscm

cm

2

The crucial change is to schedule the Lengthen and Shorten processes so as to maximize the desired effects. In the HC solution, these effects were linked; they shared a single q2 command action. Since there is no Lengthen/Shorten decomposition of the action, its causally distinct phases are not appreciated. It is only after the explanation is constructed that the elements and their causal relations become clear: the explanation conjectures that a single HC control action is not a single conceptual item. Rather, each is composed of two conceptually distinct actions that

happen to be temporally adjacent and happen to use the same q2 command parameter.

SDR Leng-CW

q2-CW

SDR Short-CW

q2-CW

SDRSUL

SDL SUR

SUR Leng-CW

q2-CW

SDL Short-CCW

q2-CCW

SDL Leng-CCW

q2-CCW

SUL Leng-CCW

q2-CCW

Figure 9: Conjectured Qualitative Control Strategy Underlying HC

19Optimization alters the

qualitative relation among process instances. The transformed qualitative control strategy is shown in Figure 11. While the new qualitative strategy is causally identical to the original, it appears to be quite different: there are now four command actions within each cycle rather than two; during half of the commands (the ones achieving a shortening of the ACROBOT) is specified to be qualitatively as large as possible; likewise, the q deflection angle is

maximized. The new qualitative strategy is still within the structure of the generalized explanation; it is simply instantiated in a way to enhance energy increase.

2q&

2

Swing Down Right

Swing Up Left

Swing Down Left

Swing Up Right

q2-cw

q2-ccw

Shorten-cw

Lengthen-ccw

Shorten-ccw

Lengthen-cw

Min(qs )cm Max(qs )cm Min(qs )cm Max(qs )cm Min(qs )cm

TIME

Figure 10: Scheduling Actions to Optimize Energy Pumping

The transformation is based upon a number of plausible conjectures. At this point in the processing there is no guarantee that these changes will help, or that the altered strategy will work at all. The anticipated improvement is contingent on the explanation adequately characterizing the real world. As in all Plausible Explanation-Based Learning, the adequacy of the explanation must be empirically confirmed before the strategy's conjectures can be believed.

In the fifth and final step, the qualitative strategy is quantified into an executable control strategy and the strategy is exercised. The parameters for each control action must be given computable values. Each q2 control action is defined by three parameters: a terminal deflection angle, a desired rotational velocity , and a time point to begin the process. The deflection angles are qualitatively prescribed to be maximal by the qualitative optimization process. These maximal values are directly supplied by the simulator. The rotational velocities for the q2

q2

2q& SDR Leng-CW

q2-CW

SDR Short-CW

q2-CW

SDRSUL

SDL SUR

SUR Leng-CW

q2-CW

SDL Short-CCW

q2-CCW

SDL Leng-CCW

q2-CCW

SUL Leng-CCW

q2-CCW

SUL Short-CW

q2-CW

SUR Short-CCW

q2-CCW

Figure 11: Qualitatively Optimized Control Strategy

20actions causing Shortening of the ACROBOT are also qualitatively prescribed to be maximal and are supplied by the simulator. The q2 actions causing ACROBOT Lengthening have no qualitative requirements and their quantitative values are copied from those used in the HC training strategy.

Scheduling the actions is more interesting. In the HC strategy, the two q2 command actions commence at observable time points (when crosses zero). Now there are four command actions and they must begin at a prescribed time before the termination of each Swing process. These time points are not observable. The end of the process and the start of the next process can be observed, of course, but by then the time to commence the action has already passed.

2q&

In such cases the system attempts to construct a quantitative function to estimate the quantities required to make the unobservable time point computable. In this case, it suffices to know the duration of each of the four Swing processes. A qualitative explanation is constructed for these durations. The durations of the two Swing-Up processes are conjectured to be qualitatively monotonic functions of the rotational speed, qs , at the time the processes begin. The durations

of the two Swing-Down processes are conjectured to be qualitatively monotonic functions of the center of mass deflection at the time the processes begin. In the current version of the system, estimation functions are realized by cubic polynomials which are empirically calibrated by a regression fit to observed data. Other families of functions have been examined and also work. Cubics offer an attractive compromise among flexibility, expressiveness, and efficiency of learning.

cm

Next we examine in more detail the acquisition of one estimator function to compute the duration of Swing-Down-Left process from the observable deflection of the center of mass at the point Swing-Down-Left begins. Similar processing underlies acquisition of the other three functions. The SDL estimator is a cubic polynomial in one dimension. The independent variable is the π − qcm deflection at the time that (which initiates the SDL process). The

estimator itself is represented as a vector of four floating numbers which are the cubic polynomial coefficients. The initial component values are set arbitrarily.

0=cmq&

The initial EBC control strategy (with the initially-arbitrary estimator functions) is applied to a sequence of random initial ACROBOT configurations. The results are captured, analyzed, and used to refine the strategy. Several representative traces of initial control attempts are shown in Figure 12. The variations are due in part to the arbitrary initial estimation function and in part to the different initial conditions of the ACROBOT. About half the initial examples resemble the one shown in Figure 12a. Here no significant pumping actions occur. This is generally because the

21estimator polynomial computes an action schedule that cannot be executed (generally because the process is not active at the scheduled time). Traces like Figure 12b are also common. Here the basic control actions are reasonable at the beginning of the trace but incorrect timing quickly causes the actions to enter a nonsystematic behavior which may or may not be followed by additional periods of systematic behavior. Infrequently the initial attempt resembles Figure 12c showing immediate and persistent nonsystematic behavior.

Figure 12: Three Types of Initial Behaviors

Training data for the estimator functions are extracted only from the portion of behavior that plausibly follows the qualitative control strategy shown in Figure 11. The entire behavior of Figure 12a plausibly follows the qualitative strategy. It gives rise to three and one half pendulum swings yielding four complete SDL and SUR process instances and three SDR and SUL instances. Their durations are measured from the captured trace as is the respective independent variable for each. Thus, seven training points are extracted. The behavior of Figure 12b yields just one data point for each function before the behavior departs from the desired qualitative strategy. Behaviors like the one shown in Figure 12c yield no useful training data for the estimation functions. The initial data is very noisy; the points are well away from the true durations of the processes under their respective conditions.

22The measured durations are likely to be much better approximations to the desired values than

the values calculated by the original arbitrary cubic polynomials. After the captured trace is analyzed, a new cubic regression function is fit to the accumulated empirical points. The strategy is then executed again on a new swing-up problem with the new estimator functions. This repeats, accumulating training data until the estimator functions converge. Convergence is the point at which, for the existing system noise, the estimator functions provide duration characterizations that are as accurate as those yielded by the actual trace measurements.

It might first appear that the measured durations are not approximations at all but are, in fact, objectively correct. After all they measure the desired process duration for the observed arm configuration. While this is true, we must remember that the full arm configuration is not monitored. Doing so would greatly increase the dimensionality of the estimator function. In turn this would require an unrealistic number of training examples. Instead, we should think of the arm configuration as summarized by the (in this case single) explicit independent variable: the initial deflection angle for the swing down processes and the initial velocity for the swing up ones. The qualitative explanation can be interpreted as stating that the explicit independent estimator variables are sufficient to predict the desired dependent values provided that the qualitative strategy is adequate. The ACROBOT is a driven double pendulum. Its period, the combined swing-up and swing-down durations, is in large part a function of the driving policy. The adjustments to the driving policy unavoidably add to the observed variance. Until the policy converges, the measured Swing durations are necessarily noisy approximations to the durations of interest.

23

a) 3 runs, 4 SDL durations b) 5 runs, 9 SDL durations c) 9 runs, 17 SDL durations fit: undefined fit: -0.01 fit: 0.465

d) 13 runs, 29 SDL durations e) 20 runs, 50 SDL durations f) 100 runs, 335 SDL durations fit: 0.882 fit: 0.943 fit: 0.954

Figure 13: SDL Duration as a Function of Initial π − qcm after 3, 5, 9, 13, 20, and 100 Trial Runs

Figure 14: ACROBOT Behavior for Learning Trial 20

24Importantly, the estimator functions not only converge but do so quickly. Figure 13 shows the

progressive refinement of the SDL estimation function to the observed training durations.

After 9 trials the approximation function changes relatively little, although the confidence, reflected in the goodness-of-fit, does not reach an acceptable level until trial 13. The other three estimation functions improve similarly, although the cubic polynomial family cannot supply quite as good a fit for the Swing-Up durations. In repeated learning sequences the goal is nearly always achieved after the 10th trial. Consistency and robustness of learning are discussed in more detail in Empirical Results section which follows. The behavior trace of trial 20 is shown in Figure 14. This is typical of the remaining trial runs.

Empirical Results

How good is the EBC solution? There are two other known swing-up control strategies. How does EBC compare with them? In this section we empirically compare EBC's behavior with the other strategies. The first, HC, has already been discussed. The second, called ATAN, removes the discontinuities that make HC's behavior difficult to analyze. ATAN chooses a command trajectory of . This strategy is not heuristic; a rigorous justification of its

swing-up behavior can be derived. Details are given in (Spong 1994). In the following experiments the three strategies of EBC, HC, and ATAN are compared.

)arctan( 1212qkkq &=

Experiment 1: Learning Cost

How quickly does the EBC strategy learn to control the ACROBOT? After learning, how good are its solutions relative to the other known strategies of HC and ATAN? Friction is known to play a crucial role in making the swing-up problem a challenging one. Is one strategy superior to the others for all friction settings or are there areas of "expertise" limiting the range of dominance of one strategy over another?

To investigate the relative solution qualities, the simulator was run under control of each of the strategies at three different friction levels while collecting time to swing-up in simulator time ticks. In no case was the ACROBOT continuing to make progress toward the swing-up goal at the end of 3000 ticks; either the goal had been achieved or the energy level of the ACROBOT had ceased any systematic increase. Thus, swing-up time of 3000 in the following graphs indicates goal failure.

Each strategy's behavior is averaged over 20 runs. The initial arm configuration is selected randomly as follows: the arm is straight and motionless, , with an initial q 0122 === qqq &&

1

25

0

500

1000

1500

2000

2500

3000

1 7

13 19 25 31 37 43 49

0

500

1000

1500

2000

2500

3000

1 7

13 19 25 31 37 43 49

0

500

1000

1500

2000

2500

3000

1 7

13 19 25 31 37 43 49

a) Low Friction b) Medium Friction c) High Friction Figure 15: EBC , HC , and ATAN Swing-up Time Ticks vs. Training Trial for

High, Medium, and Low Friction Conditions

deflection selected uniformly between 0.2 and 1.2 radians. This corresponds to an initial energy level of between 1% and 31% of the energy needed to swing up the arm.

For the non-learning systems (HC and ATAN), a run is a single swing-up problem. For EBC, each run is a sequence of 50 trial problems. To commence the run the estimator functions are initialized to arbitrary real cubic coefficients.

Figure 15 presents the resulting data as learning curves for the EBC strategy along with swing-up times for the two non-learning strategies. The vertical bars on the EBC learning curve show the 95% confidence range for the swing-up times averaged across the 20 runs.

In all cases, the EBC strategy begins to achieve the goal after relatively few attempts. Furthermore, after learning and in all friction conditions, the EBC strategy is superior to the other two. The dominance of EBC increases as the problems become more difficult due to increased friction. This pattern was observed to hold at many other friction settings as well.

A learning system pays a price avoided by non-learning systems. Even if the resulting solutions are superior, the cost of learning may be too high to be worthwhile. Under what conditions is the added cost of learning adequately compensated by the improved behavior it engenders? In Figure 16 we present the data in a different manner. Here the cumulative swing-up time is computed over the 50 trial problems. These cumulative times are averaged over the 20 runs. The nonlearning strategies produce straight lines whose slope is the average solution time over the 20 runs. For EBC, the initial curvature is high; the time to solve each problem is high but decreasing. The EBC curve flattens out as learning ceases to improve performance. As the EBC line crosses the line of another strategy the EBC penalty for learning compared to the other strategy is just recouped by the accumulated improvement performance. After crossing, the lower EBC curve indicates that savings continue to accumulate.

26From the different friction conditions it is clear that the benefit of learning increases with

problem difficulty. In the condition of Low Friction, the learning cost of EBC, while small, is never recouped in the trial window of 50 problems. In the Medium Friction condition the EBC learning cost is recouped after 29 trials compared to the weaker ATAN strategy and after 34 trials when compared to the HC strategy. As the friction increases EBC's learning consumes more trials for practice but also produces relatively better swing-up performance. In the High Friction condition EBC dominates ATAN which fails to achieve the swing-up goal at all. Compared to HC, the cost of EBC's learning is recouped after 20 trials.

Experiment 1 indicates that learning is relatively quick and that after learning, the EBC strategy uniformly dominates the other two. However, in the low friction condition the advantage is marginal and learning may not be justified. Thus, for low friction problems any strategy suffices and the non-learning ones offer an advantage of not paying the initial cost of learning. The story is quite different in the high friction case. The non-heuristic strategy, ATAN, fails to achieve the goal in any of its 20 attempts. The EBC strategy fails only during the first few attempts when it is learning the fastest. The HC strategy can achieve the goal but much less efficiently than EBC. For high friction conditions the EBC strategy provides clearly superior performance.

Experiment 2: Performance on Increasingly Difficult Problems

Experiment 2 further explores the trend of solution quality as a function of problem difficulty uncovered in Experiment 1. We compare resource utilization of the different strategies in solving gradually more difficult problems. There are three important resources consumed by any strategy. First is the time needed to achieve the goal. A strategy that more quickly imparts the requisite energy to the ACROBOT can be preferable to one that takes longer. A second resource is the total energy needed by the strategy to achieve the goal. A third and final resource is power.

0

5000

10000

15000

20000

25000

1 8

15 22 29 36 43 50

05000

10000150002000025000300003500040000

1 8

15 22 29 36 43 50

0

50000

100000

150000

1 8

15 22 29 36 43 50

a) Low Friction b) Medium Friction c) High Friction Figure 16: EBC , HC , and ATAN Cumulative Swing-up Time Ticks Over a

Run of 50 Trial Problems

27This is quite different than energy efficiency. Power is the time rate of change of energy. Thus, a more power efficient strategy is one that can successfully achieve a goal with a smaller motor even if it perhaps takes more time and more total energy in achieving the goal.

For this experiment, the EBC strategy was trained on 100 random problems and then frozen. The simulator was controlled by each of the three strategies

in turn at gradually increasing levels of friction. The system's friction parameter was varied from 0.0 to 1.0.

0500

100015002000

25003000

35004000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Figure 17: EBC , HC , and ATAN Time to Swing-up in Ticks by Friction (see text)

We must pause a moment to discuss the measure of friction. Frictional loss in the ACROBOT system is through the two joints. Friction is modeled as a fractional loss of the velocity at each simulation time tick. As rotation increases, a joint loses linearly more of its velocity. The same friction parameter is applied at both joints. When set to 0.0 no velocity and therefore no energy is lost to friction. With the friction parameter set to 1.0, frictional loss is extreme. At this setting when the ACROBOT is released from a deflection of 1 radian with joint 2 locked (single pendulum mode), only 3% of the energy remains after one swing cycle. The frictional loss through joint 1 consumes 97% of the system's energy. At this setting, no swing-up algorithm was able to achieve the goal.

At each friction level the performance of each strategy is averaged over 100 swing-up problems. The initial configurations are drawn according to the random distribution employed in experiment 1. Trials were ended if swing-up was not achieved after 4000 time ticks.

The data is analyzed first for the time efficiency of the strategies. The results are shown in Figure 17. The efficiency ordering is insignificant until the friction parameter reaches 0.5. At this point the time efficiency of ATAN begins to decay quickly. By a friction level of 0.54 ATAN is no longer able to achieve swing-up. Shortly thereafter HC's performance begins to degrade. HC fails to achieve any swing-up goals above the 0.76 friction level. EBC continues to solve swing-up problems until the 0.94 level. This clearly indicates that EBC is the most robust strategy of the three with respect to frictional dissipation.

28At some friction level each strategy must fail to achieve the goal. It is clear that once trained,

EBC provides the superior time efficiency as well as the greatest tolerance of frictional losses. It is interesting to note that the performance degradation is precipitous for all strategies.

The second question investigated in this experiment is energy efficiency. Energy is not uniformly added to the ACROBOT by any of the strategies. Actions sometimes add and other times remove energy. Additionally, frictional losses bleed energy at a rate dependent on the instantaneous ACROBOT state whose trajectory varies widely among the strategies. One measure that can be used to compare energy efficiency is the sum total energy added to the ACROBOT while pursuing a swing-up goal. The less energy that a strategy needs to add in achieving a swing-up goal, the higher its energy efficiency.

Figure 18 shows the cumulative energy added to the ACROBOT by each strategy as a function of frictional loss. For easier problems (below a friction level of 0.5) the total energy consumed is consistently lowest for ATAN with EBC highest. As each strategy nears its friction limit, energy consumption jumps. The energy consumption after this jump is of little concern since none of those swing-up problems are achieved; the ACROBOT is

essentially thrashing. At low friction, EBC is less energy efficient than the others. Its better behavior on difficult problems seems to be a direct consequent of EBC's robustness to friction and the precipitous failure mode exhibited by all of the strategies. It is interesting to note that the energy required reaches a kind of plateau at the point where the strategy begins to fail to achieve the swing-up goal. Furthermore, the height of the plateau is in the same order as the total energy efficiency for solvable problems

0

1000

2000

3000

4000

5000

6000

7000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Figure 18: EBC , HC , and ATAN

Cumulative Trial Energy by Friction

0

5

10

15

20

25

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Figure 19: EBC , HC , and ATAN

Power Use by Friction

29and the reverse order of the strategy's robustness. We will return to this point in Experiment 3.

Next we examine the maximum power use of the three strategies. The maximum power is measured as the largest energy change between any two successive clock ticks in the course of attempting a swing-up goal. It is roughly an indication of the minimum size of the motor required to perform the strategy. Alternatively it can be viewed as the highest instantaneous electrical current required to drive a sufficiently large motor to faithfully follow the strategy.

Figure 19 shows that the maximum power use is nearly independent of whether the goal was achieved or not. It is primarily a function of the swing-up strategy and the friction level. Furthermore, the EBC strategy requires significantly higher power than either the HC strategy or the ATAN strategy.

In total, Experiment 2 indicates that EBC is preferred if time efficiency is important or if frictional losses are high. However, if low power consumption is crucial and at the same time friction losses are small then ATAN is the strategy of choice.

Experiment 3: The Effect of Power on Performance

Figures 18 and 19 suggest a possible alternative interpretation for EBC's good performance. It is possible that EBC's success is not due to a qualitatively more sophisticated strategy: EBC's higher power utilization may be at the heart of its superior behavior. Perhaps all strategies are approximately equal in pumping effectiveness and EBC is successful simply because of its relatively more energetic motions which consume more power and consequently can pass energy more quickly to the ACROBOT. Then it is not the timing or separation of the q2-actions that makes EBC successful, but its raw power consumption.

If this is the case then HC and ATAN should enjoy similar time efficiency and robustness against frictional losses if their actions were made as energetically as EBC's. This is a testable hypothesis which leads us to experiment 3. Design of this experiment requires some subtlety, however. An obvious method to examine this question might be to systematically alter the HC and ATAN strategies so that the same actions are performed but sufficiently more quickly that their maximum power consumption are at the level of EBC's. This, however, may not be a fair test. The ATAN and HC strategies would be significantly altered while the EBC strategy would be left entirely intact. Such a comparison might naturally favor the intact EBC strategy over the modified ATAN and HC ones.

30For experiment 3, all three strategies' actions were modified similarly. Each strategy was

replaced by one that performs the same moves but sufficiently more quickly so as to use 50% more maximum power.

Evidence for this new interpretation would be a systematic improvement in all strategies over their behavior in Figure 17. To confirm this hypothesis it is not necessary for either of ATAN or HC to outperform EBC; it is sufficient that they outperform their unenhanced behavior. On the other hand if a strategy fails to improve its performance it is thereby demonstrated not to be only power limited.

Note that none of the three strategies are re-tuned. ATAN has two constant parameters, HC has three constant parameters, and EBC has three constant and four functional parameters. None of these are altered. With the most parameters interacting to yield a behavior, the lack of re-tuning is perhaps most unfair to EBC. However, this lack of re-tuning is consistent with the hypothesis which suggests that raw power delivery (and not its careful application) is important for robust swing-up behavior.

The results of the investigation are displayed in Figure 20. The EBC strategy (a) is not significantly helped nor hurt by the additional power. The HC strategy (b) is dramatically worse while ATAN (c) suffers a small but significant performance penalty.

0500

1000150020002500300035004000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0500

1000150020002500300035004000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0500

1000150020002500300035004000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

a) EBC b) HC c) ATAN Figure 20: Swing-Up Time vs. Friction

under Normal and 50% Higher Power Levels

Clearly none of the strategies is simply power limited. The data are inconsistent with the interpretation that higher maximum power, rather than its judicious use, is behind EBC's success. Similar though proportionately smaller effects are found for other power levels less than 50%. At greater than 50% the effects are magnified with EBC also soon showing degraded performance.

It is interesting that the effectiveness of the HC and ATAN strategies (and EBC also at higher additional power) seems to be related inversely to maximum power use. This suggests yet

31another interpretation. Perhaps somehow the strategies are on the negative side of some power curve and that by decreasing the maximum power, the pumping effectiveness can be improved. While this interpretation seems less likely a priori, the data of Figure 20 suggest it as an intriguing possibility which must be refuted if our interpretation of EBC's success is to stand.

To address this question, the experiment was re-run but this time altering the strategies to use 50% less maximum power. The results are shown in Figure 21.

0500

1000150020002500300035004000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0500

1000150020002500300035004000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0500

1000150020002500300035004000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

a) EBC b) HC c) ATAN Figure 21: Swing-Up Time vs. Friction

under Normal and 50% Lower Power Levels

Using only half of their original power, all of the strategies perform significantly worse. Again intermediate reductions in power yield less pronounced performance degradation. One possible interpretation of Figures 20 and 21 suggests that HC and ATAN are already nearly optimized for power use. Swing-up requires significant excursions into the nonlinear portions of the system's state space. It is quite possible that power is best utilized in quite different ways across these nonlinearities. HC and ATAN were hand-designed and hand optimized through trial-and-error. They may embody close-to-optimal tradeoffs throughout the entire swing-up solution. Adding power uniformly may well help in one portion of the solution while harming the effectiveness in another.

Interestingly, in this experiment the HC strategy shows a qualitatively different sort of failure which is in fact displayed by the other strategies as power is decreased below the 50% level. Upon further investigation, this different degradation mode can be traced to the power limitation. With inadequate power the difficulty of a problem (and a strategy's success on that problem) begins to be significantly influenced by the problems' initial conditions. Starting the task with a small initial deflection of 1% of the swing-up energy or a large initial deflection of 31% dictates whether a strategy can solve the problem. With inadequate power at its disposal, The goal can still be achieved for large initial deflections but not for small ones. As friction is increased, the initial energy required for successful swing-up also increases giving rise to a systematic (though high-variance) trend toward poorer performance.

32In sum, Experiment 3 shows that timing and configuration interactions for the current HC and

ATAN parameters values achieve near-optimal behavior for the respective strategies. The EBC strategy, on the other hand, sets its own parameters (q2 deflection angle α, , and action scheduling times). An interesting and significant empirical result of experiment 3 is that EBC's power-determining parameter, , can be safely dictated by other considerations (e.g., power consumption, motor cost, and so on). EBC's successful behavior is relatively robust so long as available power does not fall below some minimal level.

2q&

2q&

Relation to Standard Control Theory Approaches

Control theory is a sophisticated, mature, and diverse discipline. We cannot adequately compare EBC to all of its many areas. However, it is important to understand why the more common approaches cannot easily solve the ACROBOT swing-up problem. To this end we briefly outline the features of adaptive control, robust control, learning control, repetitive control, optimal control, reinforcement learning, fuzzy control, and heterogeneous control that distinguish them from explanation-based control, particularly in the context of the ACROBOT swing-up problem.

To begin, a question naturally arises within AI as to whether so simple a system as a double pendulum can manifest behavior beyond the scope of standard control theory. In fact this is possible precisely because of control theory's maturity. The strengths and weaknesses of the major control theory approaches are comparatively well understood. This understanding gives rise to the identification of classical problems which capture the essence of a difficulty while avoiding the overhead of extraneous complexities. Small changes in these problems encounter performance cliffs. Controlled balancing of the underactuated double pendulum at the unstable equilibrium point is easy while controlled swing-up the same double pendulum is extraordinarily difficult. The acknowledged existence of such performance cliffs is directly due to the maturity of control theory as a field. It is no coincidence that the first control theory solution to the ACROBOT swing-up problem was the ad-hoc heuristic control algorithm that we refer to as HC. The HC control scheme is not a product of any standard control theory paradigm. We now turn to a number of the more common paradigms and for each argue why swing-up is a poor problem match.

First, consider robust and adaptive control (Astrom and Wittenmark 1989; Narendra and Annaswamy 1989; Dorato and Yedavalli 1990). While these embody very different techniques, their goals are similar from the perspective of the swing-up task. Both are concerned with effective control in the face of less than perfect knowledge of the target system. Robust control

33techniques aim at effectively controlling a real system even though the differential equation model upon which control is based is itself only an approximation or an idealization. While one can often prove adequate control of the idealized system, robust control techniques extend the range of adequate behavior to different but, in some precise sense, near-by systems which, one hopes, includes the real-world system. In the appropriate circumstances one can prove strong results guaranteeing good target system control in spite of this model mismatch.

Adaptive control is motivated by somewhat different concerns. Here, the form of the differential equations are known but some parameters are uncertain. The real-world system is stipulated by assumption to be some member of the parameterized family defined by the form of the equations. Adaptive control specifies conditions under which initial interactions with the real-world system result in effective estimation of these unknown but crucial parameters. Estimation must be sufficiently fast to allow adequate control of the target system. Thus, both robust and adaptive control provide techniques to remedy the difficulty of not knowing a precise differential equation model of the system to be controlled. However, the ACROBOT swing-up problem is not a model identification problem. We can assume the equations of Table 1 perfectly describe the double pendulum of interest. Then there are no parameters to estimate nor is the system to be controlled different than the idealized model. Yet, the swing-up task remains difficult. It should be clear that the swing-up problem simply falls outside of the realm of expertise of adaptive and robust control.

At first blush, the field of learning control might seem more appropriate. Learning control is closely related to another paradigm known as repetitive control (Arimoto 1993; Farrell and Baker 1993; Spong 1996). In repetitive control the target system is intended to follow some precisely repeating cyclic pattern. As the pattern repeats, errors are observed, and information of the control action deficiencies on this cycle is gained. This information can be used to alter the selected actions so that on the next cycle the errors will diminish. Learning control is conceptually similar except that instead of a cyclic pattern, the system is reinitialized to the beginning of the desired trajectory. This situation occurs, for example, in industrial robotic pick and place operations. In both learning- and repetitive control the target system is to follow a particular and unchanging reference trajectory a great many times to guarantee convergence. Such precise repetition is not possible in the swing-up task. While generally cyclic, the pattern must be changed steadily and in a non-linear fashion with each swing. Both minute amounts of noise and slight variations in the starting configuration require large adjustments later in the swing-up procedure. Learning control would be appropriate only if the ACROBOT system was constrained to start in precisely the same initial state on each trial and if the system were

34constrained to exhibit no noise. Without both of these conditions, learning control is not applicable; with these conditions, the swing-up problem is made artificially easy.

Next, consider optimal control. One way of thinking about optimal control is that the controller itself rather than the human user selects the trajectory. The human specifies some desired state and the controller selects the best way to bring it about. Usually, "best" means optimal using some measure of control effort although other objective functions can also be adopted. The most common approach is known as the LQG (for Linear-Quadradic-Gaussian) model in which the system is linear, the objective function to be minimized is quadratic, and noise is Gaussian. The Gausian noise assumption insures that any antagonistic behavior sufficiently systematic to disrupt control is extremely unlikely. The quadratic penalty function insures that positive and negative control forces do not cancel each other. Control effort is squared in the penalty function. This has the advantage over simply taking the absolute value because a quadratic function is easily optimized analytically: it is everywhere differentiable, its derivative is easily specified in closed form, and it has no local maxima. Unfortunately the system to be controlled must be linear. The ACROBOT system is not linear and, as discussed earlier, the swing-up problem requires the system to be driven far into the nonlinear range so that the system cannot be effectively linearized. The LQG approach is not well matched to the swing-up problem.

A very different, but also well-known form of optimal control is dynamic programming (Bellman 1957; Bertsekas 1987). It does not make the assumption of system linearity, and it naturally applies to tasks, such as the ACROBOT swing-up problem, that require a sequence of diverse actions to be applied while the system changes state. Perhaps because of these last two attributes, dynamic programming has received a good deal of attention from the AI community. As early as 1968 (Michie and Chambers 1968) researchers exploited ideas from dynamic programming to guide on-line decision making. More recently, dynamic programming has significantly influenced the machine learning community as the conceptual foundation of reinforcement learning (Sutton 1992; Kaelbling, Littman et al. 1996). There are two drawbacks to these techniques. First, they require discrete states and actions. Many actions (like applying a torque at joint 2) and state variables (like the angular velocity at joint 1) are naturally represented as real values. However, reinforcement learning requires propositional (i.e., zeroth order) representations to limit consideration to a finite number of actions and states. If the implementor's prior discretization is too coarse, no solution may be possible. If it is too fine, excessive learning time is wasted in redundant work. A second drawback is the example complexity for reinforcement learning. The number of states is exponential in the number of

35state variables and proofs of convergence for incremental dynamic programming require each state to be visited infinitely often (Whitehead 1991; Watkins and Dayan 1992).

Nonetheless, reinforcement learning has been successfully applied to the task of ACROBOT swing-up (Sutton 1996). We view this as a significant accomplishment since the reinforcement learning approach avoids the additional inferential mechanism that accompanies EBC's declarative prior knowledge. Sutton employed a sparse/coarse coding approach known as CMACS to help solve the dimensionality problem. Other assumptions included no friction, no noise, and a unique repeated starting configuration. Initial investigations indicate EBC to be more robust as the assumption of a unique starting state is weakened. Relaxing the no-friction and no-noise assumptions appear to give rise to a more interesting set of tradeoffs. A full comparison is the subject of a current investigation.

The EBC solution derives its advantage from effectively blending prior domain knowledge with empirical observations. The prior knowledge is symbolic and qualitative and need only be a plausible characterization of the domain elements. In this regard EBC shares some features with fuzzy control (Berenji 1990; Driankov, Hellendoorn et al. 1993) which is based upon fuzzy logic (Zadeh 1987). However, the underlying approach is quite different. Fuzzy control typically requires the explicit specification of the control strategy using linguistic variables (Zadeh 1975). In EBC predicates are crisp but qualitative; they do not correspond to degrees of membership. Furthermore, there are no fuzzification/defuzzification steps, the control strategy is constructed through plausible inference, and symbolic explanation-based learning is a central (even defining) characteristic of the approach.

Related to fuzzy control is heterogeneous control which blends conventional local controllers into a global nonlinear control strategy. Kuipers and Astrom (Kuipers and Astrom 1994) have advanced an approach to validating some classes of heterogeneous controllers. In particular, the approach involves constructing a modal proof over the set of possible qualitative behaviors as derived by QSIM (Kuipers 1986). The modal nature of the proof dictates that the qualitative description of the combined system/controller be sufficiently accurate that no unstable impossible behaviors be included. This is not currently the case for the ACROBOT with either the HC or EBC controllers. Furthermore, the swing up task would require an inductive modal proof. This can be very difficult (Zhang 1996). Nonetheless, their approach points in an interesting direction which we are currently considering.

36

Conclusions

We have shown that explanation-based control can generate a control strategy competitive with the best existing strategies from control theory. Control of the ACROBOT mechanism is difficult because the solution requires repeated actions whose characteristics change with state. Most solution trajectories reach well into a non-linear region of the system's phase space so no linearization is possible.

EBC's solution to the ACROBOT swing-up problem illustrates the power inherent in the ability to make explanatory distinctions that have no directly observable indicator. EBC interprets each HC control movement as two causally distinct actions accomplishing two distinct subgoals. Once separated, these actions can be scheduled and optimized individually. This separate treatment is at the heart of EBC's successful behavior for the ACROBOT swing-up problem.

From a different point of view, EBC's success can be seen as due to the new dimensions it introduces into the control space. Each original HC swing cycle is determined by three parameters: a beginning time, an ending time, and a joint velocity for its single control movement. The EBC swing cycle contains two actions and, therefore, twice as many control parameters. These afford additional flexibility in the control space and as it happens this new portion of the space contains a more efficient strategy. From this perspective we can also interpret the role played by the plausible background theory. Adding dimensions to the control space must be done judiciously since increasing the dimensionality of the hypothesis space can make finding a good strategy dramatically more difficult. The space of control strategies should be increased only if no adequate element is likely to exist in the original space. In EBC a new explanation is constructed (yielding a revised space of controllers) only when training on the original space has proved fruitless. Furthermore, EBC is cautious; the revised space corresponds to an explanation which qualitatively accounts for the previous deficiencies.

A major motivation of this work is the real-world importance of continuous and overlapping world change. Several prior systems have explored this problem from the perspective of reasoning about differential equations (Sandewall 1989; Dean and Siegle 1990). Like conventional control theory, these approaches are susceptible to the continuous qualification problem discussed earlier; their differential equation model can only approximate the behavior of the mechanism to be controlled.

Penberthy, (Penberthy 1993) introduces piecewise linear constraints to augment a nonlinear planner. Hogge (Hogge 1987) and Drabble (Drabble 1993) have explored planning at the

37qualitative level without descending to quantitative manipulation. Although not a full planner, (Wellman 1988) is also of this variety. Others, (e.g., (Sacks 1990)) have employed qualitative reasoning to dynamical systems for the purposes of describing the systems behavior rather than to automatically control it. None of these, however, includes a learning component, which is central to the EBC approach.

There has also been work along other directions on intelligent control of continuous domains. Reinforcement learning (Sutton 1992; Kaelbling, Littman et al. 1996), connectionist control systems (Barto 1990), and memory-based techniques such as (Atkeson, Moore et al. to appear), are like EBC in that they employ learning from experience to help define a control strategy. However, unlike EBC these do not employ symbolic representations. EBC is likely to be preferable in domains that are complex but only partially understood. A plausible declarative domain theory and the ability to inferentially combine its components in novel ways can be important in such domains. On the other hand, an inferential capability is not helpful in completely novel domains where no prior expert is available or domains of relatively low dimensionality or whose relevant interactions can be accurately and quantitatively characterized. For such applications, EBC is likely to perform less well than these alternatives.

A case-based approach to acting in continuous domains is offered in (Ram and Sanamaria 1997). The domain is robot path planning. A set of parameterized reactive motor schemata are given to the system which learns how best to adapt the schema parameters to elicit both simple and complex emergent behaviors. Unlike EBC, inferencing over background knowledge does not play a central role in case retrieval or case adaption. However, such inferencing is not uncommon in other case-based systems (Kolodner 1993) and might conceivably be added to this one. A particular strength of their research is the use of a real-world robot for some of the investigations. The approach is an exciting one but the path-planning problems, though far from trivial, are hardly challenging for state-of-the-art nonlearning path planners. The process rather than the results are of primary interest.

SPLICE (Rogers 1997) employs the SOAR architecture to learn effective behavior in several continuous domains. The domains include speed control of a simulated automobile and speed and altitude control of a flight simulator. The latter domain is quite complex and some fascinating successes and failures are discussed. The system is similar in spirit to EBC in its melding of symbolic inference to hypothesize control effectiveness with numerical interpolation to support quantitative control decisions. The most important difference is in the use of symbolic reasoning. In EBC, symbolic reasoning is performed only during the refinement of a concept. In SPLICE, symbolic problem solving occurs with concept use as well. This stems from the SOAR

38methodology in which learning is chunking search control knowledge. This difference is philosophical as well as practical and will have implications for the kind of architecture required for real-time execution. In addition, the SPLICE system seems to have a stronger commitment to its prior symbolic model than the plausible qualitative rules in EBC. Another difference is that SPLICE makes a deep assumption that a single output quantity is controlled by a single input variable. This assumption appears to be more implementational than conceptual although it seems to help avoid an otherwise combinatorial search through alternatives.

There are several extensions planned for the work. We hope to investigate applying EBC in a more autonomous modality. As described, the approach is strongly in the learning apprentice mode (Mitchell, Mahadevan et al. 1985). It requires a solution trace (provided for the ACROBOT by HC) before a control strategy can be produced. The need for some prior method of solving the control problem is constraining and may not always be necessary. For the ACROBOT, as with many dissipative systems, the difficulty is not just to pump energy into the system, but to pump it in faster than it is dissipated (in this case by friction). It has been noted (Spong 1994) that without friction many strategies can achieve the swing-up goal. Indeed, we have found that random movements are sufficient to swing up the ACROBOT if the simulator's friction parameters are set to zero. This quite intuitive: the ACROBOT's energy performs a kind of random walk as energy is pumped in at one time and out at another depending on the arm configuration and the random action executed. With a reasonable size motor driving the q2 joint, it is statistically unlikely that the ACROBOT can indefinitely avoid gaining the requisite swing-up energy. Such a strategy is impractical and probably could not work even in principle in the real world where friction is always present. However, when presented with a trace of the random strategy achieving the goal on a simulated mechanism, the explanation engine of EBC might be able to construct plausible justification for its successful behavior. After all, a random action sequence that achieves the swing-up goal does so for good reason which should be understandable by a plausible explanation generation system. Then, through EBC's optimization, it may be possible to automatically construct a control strategy is more robust than the training sample. The resulting strategy might work in the real world even though the training example could not. Is it possible for the final controller to be more robust than the source of the training examples? Yes, the experimental section demonstrates that the final strategy of EBC's ACROBOT controller is indeed more robust than the original HC trace.

q2

Another, in some sense opposite, direction to extend the work is to interact more with the human expert. Currently, the EBC system constructs the qualitative explanation autonomously from the expert-supplied qualitative background domain theory. However, the expert might have

39knowledge or preferences about explanations. At least, an expert's experience and intuition may supply a better plausible ordering to the explanations than the current simplicity heuristic. The expert may wish to reject some explanation entirely or to patch a failed explanation in a certain manner. Perhaps the expert believes that certain sub-justifications should or should not occur together. Perhaps the expert will wish to refine the system's qualitative domain theory in response to its experiences. The current EBC allows no such expert interaction in constructing the qualitative interpretation. The explanation construction phase is the most fragile component of the system. Clearly, there are many alternatives. The entire responsibility for explanation-construction might be given to the human user. The SPLICE system (Rogers 1997), for example, requires the expert to input a complete causal influence graph. A more balanced approach between EBC's and SPLICE's seems promising. Such symbiotic or semi-automated approaches have proven fruitful in theorem proving (Boyer and Moore 1979) and knowledge-acquisition (Davis 1980) and could be of great help in intelligent control design.

Finally, and perhaps most importantly, we are working to build a deeper understanding of the EBC approach. The current investigation has been limited to the ACROBOT system and is largely empirical. While EBC is certainly applicable to other mechanisms, we do not yet understand the limits of this applicability. What tradeoffs exist? How deep is the commitment to qualitative reasoning? Would content theories in other forms work as well? Analytic results are needed to answer such questions. A major effort in our current EBC research is to construct an analytic account of EBC.

Acknowledgments

This work was supported by the Office of Naval Research under grant N00014-94-1-0684 and by NSF under grant NSF IRI 92-09394. Comments from Renee Baillargeon greatly improved the manuscript. Discussions with Mark Spong helped to clarify EBC's relation to control theory.

List of Symbols

α maximum q deflection angle in the HC and EBC strategies 2

q1 angle of the ACROBOT shoulder joint measured CCW from vertical q2 angle of the ACROBOT elbow joint measured CCW from extension of first link qcm angle to the ACROBOT center of mass measured CCW from vertical

1q& angular velocity of q 1

2q& angular velocity of q 2

cmq& angular velocity of q cm

40

qscm absolute value of cmq&

SDL Swing Down Left qualitative process SUL Swing Up Left qualitative process SDR Swing Down Right qualitative process SUR Swing Up Right qualitative process IC initiation condition - must be true for its qualitative process to begin MC maintenance condition - must remain true for the duration of a qualitative process

References

Arimoto, S. (1993). Learning Control. Robot Control: Dynamics, Motion Planning, and Analysis. New York, IEEE. 185-228.

Astrom, K. and B. Wittenmark (1989). Adaptive Control. Addison Wesley.

Atkeson, C., A. Moore, et al. (to appear). “Locally Weighted Learning for Control.” Artificial Intelligence Review.

Barto, A. (1990). Connectionist Learning for Control: An Overview. Neural Networks for Control. Cambridge, MIT Press. 5-58.

Bellman, R. (1957). Dynamic Programming. Princeton, Princeton University Press.

Berenji, H. (1990). Fuzzy Logic Controllers. An Introduction to Fuzzy Logic Applications in Intelligent Systems. Kluwer Academic Publishers. 69-96.

Bertsekas, D. (1987). Dynamic Programming: Deterministic and Stochastic Models. Englewood Cliffs, NJ, Prentice Hall.

Bobrow, D., Ed. (1985). Qualitative Reasoning about Physical Systems. MIT.

Bortoff, S. and M. Spong (1992). Pseudolinearization of the Acrobot Using Spline Functions. IEEE Conference on Decision and Control, Tuscon, AZ.

Boyer, R. and J. Moore (1979). Computational Logic. New York, Academic Press.

Crawford, J., A. Farquhar, et al. (1990). QPC: A Compiler from Physical Models into Qualitative Differential Equations. Eighth National Conference on Artificial Intelligence, Boston.

Davis, R. (1980). “Meta-Rules: Reasoning about Control.” Artificial Intelligence 15(3): 179-222.

Dean, T. and G. Siegle (1990). An Approach to Reasoning about Continuous Change for Applications in Planning. Eighth National Conference on Artificial Intelligence, Boston.

DeJong, G. (1994). “Learning to Plan in Continuous Domains.” Artificial Intelligence 65(1): 71-141.

DeJong, G. (1997). Explanation-Based Learning. Encyclopedia of Computer Science. Boca Raton, CRC Press. 499-520.

41Dorato, P. and R. Yedavalli, Ed. (1990). Recent Advances in Robust Control. Selected Repring

Series. New York, IEEE.

Drabble, B. (1993). “EXCALIBUR: a Program for Planning and Reasoning with Processes.” Artificial Intelligence 62(1): 1-40.

Driankov, D., H. Hellendoorn, et al. (1993). An Introduction to Fuzzy Control. Berlin, Springer-Verlag.

Farrell, J. and W. Baker (1993). Learning Control Systems. Intelligent-Autonomous Control Systems. Boston, Kluwer Academic.

Forbus, K. (1984). “Qualitative Process Theory.” Artificial Intelligence 24: 85-168.

Hogge, J. C. (1987). Compiling Plan Operators from Domains Expressed in Qualitative Process. The Sixth National Conference on Artificial Intelligence, Seattle, WA, Morgan-Kaufmann.

Kaelbling, L. P., M. L. Littman, et al. (1996). “Reinforcement Learning: A Survey.” Journal of Artificial Intelligence Research 4: 237-285.

Kolodner, J. (1993). Case-Based Reasoning. San Francisco, Morgan Kaufmann.

Kuipers, B. and K. Astrom (1994). “The Composition and Validation of Heterogeneous Control Laws.” Automatica 30(2): 233-249.

Kuipers, B. J. (1986). “Qualitative Simulation.” Artificial Intelligence 29: 289-338.

McCarthy, J. (1980). “Circumscription - A Form of Non-Monotonic Reasoning.” Artificial Intelligence 13(1): 27-39.

Michie, D. and R. A. Chambers (1968). “BOXES: An Experiment in Adaptive Control.” Machine Intelligence 2: 137-152.

Mitchell, T., S. Mahadevan, et al. (1985). LEAP: A Learning Apprentice for VLSI Design. Ninth International Joint Conference on Artificial Intelligence, Los Angeles, CA, Morgan Kaufmann.

Narendra, K. S. and A. M. Annaswamy (1989). Stable Adaptive Systems. Prentice-Hall International.

Penberthy, J. S. (1993). Planning with Continuous Change. University of Washington.

Ram, A. and J. Sanamaria (1997). “Continuous Case-Based Reasoning.” Artificial Intelligence 90(1-2): 35-77.

Rickel, J. and B. Porter (1997). “Automated Modeling of Complex Systems to Answer Prediction Questions.” Artificial Intelligence 93(1-2): 201-260.

Rogers, S. (1997). Symbolic Performance & Learning in Continuous-Valued Environments. University of Michigan.

Sacks, E. (1990). “A Dynamic Systems Perspective on Qualitative Simulation.” Artificial Intelligence 42(3): 349-362.

42Sandewall, E. (1989). Combining Logic and Differential Equations for Describing Real-World

Systems. First International Conference on Principles of Knowledge Representation and Reasoning, Morgan-Kaufmann.

Shavlik, J. W. (1990). “Acquiring Recursive and Iterative Concepts with Explanation-Based Learning.” Machine Learning 5(1): 39-70.

Spong, M. (1994). The Control of Underactuated Mechanical Systems. First International Conference on Mechatronics, Mexico City.

Spong, M. (1994). Swing Up Control of the Acrobot. IEEE International Conference on Robotics and Automation, San Diego.

Spong, M. (1995). Swing Up Control of the Acrobot. IEEE Control Systems Magazine. 15: 49-55.

Spong, M. (1996). Motion Control of Robot Manipulators. The Control Handbook. Boca Raton, CRC/IEEE.

Spong, M. W. and M. Vidyasagar (1989). Robot Dynamics and Control. New York, John Wiley & Sons.

Sutton, R. (1992). “The Challenge of Reinforcement Learning.” Machine Learning (Special Issue on Reinforcement Learning) 8(3/4): 225-227.

Sutton, R. (1996). Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. Eighth Advances in Neural Information Processing, MIT Press.

Sutton, R. (. (1992). “Special Issue on Reinforcement Learning.” Machine Learning 8(3/4): 225-227.

Watkins, C. and P. Dayan (1992). “Q-Learning.” Machine Learning (Special Issue on Reinforcement Learning) 8(3/4): 279-292.

Weld, D. and J. deKleer, Ed. (1990). Readings in Qualitative Reasoning about Physical Systems. Morgan Kaufmann.

Wellman, M. (1988). Formulation of Tradeoffs in Planning under Uncertainty. MIT, Laboratory for Computer Science.

Whitehead, S. D. (1991). A Complexity Analysis of Cooperative Mechanisms in Reinforcement Learning. National Conference on Artificial Intelligence.

Zadeh, L. (1975). “The Concept of a Linguistic Variable and its Application to Approximate Reasoning.” Information Science 8: 199-249.

Zadeh, L. (1987). Commonsense and Fuzzy Logic. The Knowledge Frontier: Essays in the Representation of Knowledge. New York, Springer-Verlag. 103-136.

Zhang, H. (1996). “Special Issue on Mathematical Induction.” Journal of Automated Reasoning 16(1-2).