9

Click here to load reader

Computer learning in theorem proving

  • Upload
    a-d-c

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer learning in theorem proving

IEEE TRANSACTIONS ON SYSTEMS SCIENCE AND CYBERNETICS VOL. SSC-2, NO. 2 DECEMBER, 1966

Computer Learning in Theorem Proving D. L. JOHNSON, SENIOR MEMBER, IEEE, AND A. D. C. HOLDEN, MEMBER, IEEE

Abstract—Trigonometric theorem proofs are taken as the ap­plication for the learning model described in this paper. The simple basic structure of the heuristics together with the detailed imple­mentation are developed, with evaluation as to possible generality in more general problem-solving processes.

INTRODUCTION

SIMULATION of human problem solving ability using digital computers can take many forms. In one very

real sense, any program resulting in a solution must be recognized as an artifice to replace human behavior which we have classically considered as intelligent. We can, and undoubtedly will, continuously haggle semantically over the terms in which we describe extensions of computer application, especially when such extensions venture into realms of activity which are not yet clearly defined or understood. Such an area is that of learning processes. Be­ing human, we can only evaluate computer response in this area by correspondence with human results. We do not require that the computer function in exactly the same fashion as the human, in fact, this would disturb many, but we evaluate the processes in terms of human ability to learn. This paper describes a model encompassing a frag­ment of learning—the aspect of improving behavior in the proof of trigonometric theorems.

Contemporary usage has determined that artificial in­telligence must go the way of other misused, over-used, in-group words (optimization, interface, motherhood); we must, therefore, couch descriptions of our research in other terms. Although the model described in this paper deals with the area which has often been termed as artificial or synthetic intelligence, it is of principal interest not because of its problem solving ability but because of its ability to display certain learning properties. These properties allow the model to improve its problem solving ability within the environment of the experience to which it has been exposed: forgetting, remembering, recategorizing, and associating in such a way as to succeed and sometimes fail in a manner which frequently bears an uncanny human quality. I t should be noted that the authors do not claim that the model encompasses the totality of human learning, nor that the solution methods obtained by the learning are necessarily those which would be used by a human, op­erating in the same environment. I t suffices that the learn­ing processes, as they develop, bear many characteristics similar, in quality and function, to human learning.

Manuscript received March 1, 1966. The work reported in this paper was supported by the Air Force Office of Scientific Research, Aerospace Division, Information Sciences Directorate.

The authors are with the University of Washington, Seattle, Wash.

The general process, i.e., the skeletal frame, observed within the model is extremely simple. Indeed, there is reason to believe that this must be the case for any such model which can show a reasonable degree of generality. But while the process is extremely simple, the detailed implementation of the process is complex. The appetite and capacity of the digital computer demands such details. In humans and computers alike, even a limited degree of generality must often be reached by examination of many details, subjected to a hierarchy of simple decisions.

Classically, each development in machine learning re­search has been oriented to its specific subset of human problems. The requirement for this is obvious. Often the subset is so miniscule as to appear trivial to those without experience in the field. The danger appears not in the triv­iality or naivety of the learning problem, but in the claims of generality which frequently follow the development of a model. Humans must learn by a vast number of different techniques, the selection being determined by a higher order learning process which evaluates the type and size of problem, the experience, current psychological state, and capacity of the learner, in order to fix the process or com­bination of learning processes used in even the simplest of problems.

The implication of basic governing processes to orient and organize the learning function suggests various levels of learning models. Such levels should vary from extensive generality to specific applications and codes. One means of determination of processes existent within the models of varying order is by examination of a range of problem applications, considering those aspects of each which are common and evaluating the requirement for uniqueness as it may appear.

Before we can approach the complete learning system, for either computers or humans, it is necessary to examine some of the various models contributing to the totality. I t is clear that the learning process for chess play is differ­ent from that for learning a natural language, these in turn, are separate from learning application, development, and combination of a limited number of transformations in the proof of a theorem. Although research carried on within our group considers all three of these applications, this paper describes the work in theorem proving. The most extensively investigated theorems considered are taken from trigonometry.

GENERAL STRUCTURE

The various steps or processes of learning incorporated in the model can be generally described as follows.

1) The problem is stated in a standardized form. Ini­tially this step was considered as a compromise required

115

Page 2: Computer learning in theorem proving

116 I E E E TRANSACTIONS ON SYSTEMS SCIENCE AND CYBERNETICS DECEMBER

more for the needs of the computer than human needs. Experience has indicated, however, that such a statement in a standard or canonical form has much to do with the recognizability of familiar characteristics and is often a part of human problem solving. From mathematics to gen­eral observation and concept formation, such a device acts to reduce the problem scope. Recognizing the need for such standardization, however, leads to a procedure for the production of a determined representative form to be used in categorization and association. At this step, we must move from the general to the specific and, perhaps, to the arbitrary. I t should be noted, however, that within the scope of trigonometric theorems, many elementary theorems as taken from texts were completely solved by initial standardization.

2) Characteristic sequences or strings are developed to categorize the theorem to be proven. The choice of the characteristic representation fixes much of the specific implementation of heuristics which operate upon the characteristics within the solution, often directing and limiting the function of the model. Although the heuristics used in the process of characteristic development are ingratiatingly naive, the results are such as to yield faith in humility.

3) Application of selected basic transformations to each side of the problem-identity, or the theorem to be proved, is made based upon independent reduction of complexity of each side of the theorem, i.e., simplification.

4) Selected basic transformations are applied to each side of the problem-identity, trying to make the two sides more alike. The heuristics of transformation selection as used in both 3 and 4 must allow consideration of a growing number of valid transformations and must provide in­creasing discrimination within the experience of the model.

5) As each theorem is proven, the various steps in the solution and the final theorem must be considered as valid transformations for future use. Consistent and indiscrim­inate addition of these transformations to the list of possi­ble information available for solution would ultimately re­sult in a prohibitive number of transformations to be eval­uated and considered. To prevent such an occurrence, a forgetting heuristic has been incorporated into the opera­tion of the model by which infrequently used transforma­tions are forgotten, while the most frequently used ones are given priority of consideration. An additional facility is provided to the remembered list. A transformation, if forgotten from the ready access list and then developed again within the learning experience of the model, is estab­lished with higher priority than if it had been developed for the first time. The forgetting ability of the model pro­vides that transformations useful only in the development of a specific theorem may be removed from memory. The basic transformations necessary for redevelopment, how­ever, are usually maintained.

The model, as with all but the youngest students, starts its learning existence with certain a priori knowledge. I t rec­ognizes the symbols and operations of the application of

trigonometry; it is able to perform algebraic manipulations upon equations; it has, and will keep, within it's memory five basic theorems to use as fundamental transformations. (These theorems are shown as identities ID1 to ID5 in Table I.) Other identities may be learned and forgotten but the model keeps the basic five during its entire existence.

After experimentation with the model in the application for which it was developed, the fundamental structure and heuristics were examined in the light of other applications with roughly the same requirements as the proof of trig­onometric identities. The proof of logical statements were used in this context to aid in evaluating the model's gen­erality.

TABLE I IDENTITIES PROVED AND GENERATED BY

LEARNING M O D E L

Final Identities F-Scores

ID*6 ID*7 ID*10 ID*12 ID*14 ID*18 ID*20

ID*32 ID*34 ID*36 ID*37 ID*39

Transformations Used in Proof

1;2 1 ; 2 ; 5 9; 4 l ; i l 4; 1; 10 2; 4 14; 18; 6; 18; 6

10; 4 2; 8 3; 4 3; 27 10; 10; 14

New Transformations Developed

6 7; 8; 9 10; 11 12; 13 14; 15; 16; 17 18; 19 20; 21; 22; 23; 24; 25;

26; 27 32; 33 34; 35 36 37; 38 39; 40

I D 1 I D 2 I D 3 I D 4 I D 5 I D 6* I D 7* I D 8 I D 9 I D 10* I D 11 I D 12* I D 13 I D 14* I D 15 I D 16 I D 17 I D 18* I D 19 I D 20* I D 32*

I D 33

I D 34* I D 35 I D 36*

I D 37*

I D 38

I D 39*

I D 40

tan x = sin x/cos x cot x = cos x/sin x cosec x = 1/sin x sec x = 1/cos x sin2a; + cos2x = 1 tan x cot x = 1 sin x cos x(tan x + cot x) = 1 sin z(sin x + cos x cot x) = 1 cos x(cos x + sin x tan x) = 1 sin x tan x + cos x = sec x sin x + cos x cot x = cosec x cos x(tan x + cot x) = cosec x sin x(tan x -\- cot x) = sec x tan2 x + 1 = sec2 x cot2 x + 1 = cosec2 x cos x(l + tan2 x) = sec x sin x(l -+- cot2 x) = cosec x sin x cot x sec x = 1 cos x tan x cosec x = 1 sin2x(l + tan2x) = tan2x sin x tan x

= (sec x + 1)(1 — cos x) cos x cot x

= (cosec x + 1)(1 — sin x) sin2x(l + cot2x) = 1 cos2x(l + tan2x) = 1 (sin x + cos x)/(cosec x + sec x)

= sin x cos x cosec x(sin x + cos x)

= cot x + 1 sec x(sin x + cos x)

= tan x ■+■ 1 sin x tan x -j- sin x tan3x + cos x

+ cos x tan2x = sec3x sin x tan3x + cos x tan2x -f- sec x

= sec3x

17 17 5

35 5 6

- 5 1 1 8 2

- 3 - 3

4 - 2 - 2 - 2 11

- 1 0

2

2 3 3

4

5

5

0

0

Page 3: Computer learning in theorem proving

1966 JOHNSON AND HOLDEN: COMPUTER LEARNING 117

STANDARD OR R E P R E S E N T A T I V E FORM

Standardization of format for consideration of the theorem by the simplifying heuristics can be attained by several approaches. One or more of the following three can be applied, the validity of application of each being deter­mined by the structure of the individual application. This reduction of problem scope and formalization into a canon­ical form is certainly a process valuable to humans and computers alike in the solution of many problems.

When ordering of symbols (variables, constants, con­nectives, predicates, or quantifiers) is not critical, as is the case in commutative systems, arbitrary ordering to a prescribed base may be followed. This ordering may be according to a predetermined priority system: alpha­betical, order of appearance, frequency of occurrence, or by some other means indicated by the application.

When no member of a group of variables appears in any other combination, within a given expression, the entire group can be replaced by a single variable. Care must be taken that such replacement does not interfere with the use of learned information; in logic, such a move may be extremely valuable, whereas, in trigonometry it may be self-defeating.

Within the arbitrary framework of standardization, less commonly-used connectives can be replaced by those more commonly used, i.e., manipulation within the equation to replace subtraction and division by addition and multi­plication, respectively ; in logic, implication can be replaced by other logical operations.

Specifically, the trigonometric theorem providing model submits a problem-identity to the model in a standard form described by the following procedures.

1) Multiplication throughout by any functions appear­ing as denominators.

2) Removal of all brackets by multiplication of terms. 3) Removal of all negative signs by transfer to the

other side of the equation. 4) Removal of numerical coefficients and exponents by

use of multiple occurrences. 5) Cancellation of all cancellable terms. 6) Ordering of terms throughout the identity.

Ordering within terms is accomplished by the priority: sin x, cos x, tan x, cot x, cosec x, sec x, 1. To rearrange an identity into its ordered form, the terms within the identity are individually ordered. All terms on each side of the iden­tity are then ordered with respect to those on the same side upon the same priority. When the sides are ordered, the first terms on each side are compared, and the complete sides are interchanged if necessary to give left-side preced­ence. As an example

tan x sin x = 1 + sec x

1 — cos x

tan x sin x = (1 — cos x) (1 + sec x)

= 1 — cos x + sec x — cos x sec x sin x tan x + cos x + cos x sec x = sec x + 1.

Hence, the standard form provides a unique representa­tion for any problem identity, using only the rules of algebra and an arbitrary priority of variables, constants, and connectives. Although such ordering and selection of allowable connectives will affect the solutions obtained and the experience of the model, there is every indication that human problem solvers tend to regard objects which can easily be transformed into one another as the same or equal. Certainly, for the computer, we can state a principle for problem solving models: "Quantities which can be transformed into each other should as nearly as possible have the same representation."

T H E CHARACTERISTIC SET

For each theorem considered, a characteristic set is de­veloped to aid in the categorization of the theorem and the association required for the determination of selection and ordering of applicable transformations. The following quantities will be defined as characteristics.

1) Any single object from the set of objects making up the theorem.

2) Any combination of two objects from the set which occur together in the same term of an identity. (A term is defined to be the product of objects occurring in a trig­onometric function, i.e., sin x cos x tan x + sin x cot x has two terms made up of sin x cos x tan x and sin x cot x.)

3) Any combination of two terms occurring in an iden­tity. If the two terms in a characteristic have any objects in common, these objects are removed in the formation of the characteristic. (Hence, by rule 3, the only characteristic defined in the parenthetic expression of 2 is cos x tan x + cot x.)

Each identity will have two associated sets of character­istics, one for the right side and one for the left. We will represent the left and right side characteristics by C and D, respectively. As an example of the development of a characteristic set, let us examine the identity

sin2x (1 + tan2x) = tan2£.

In standard or representative form, this is expressed as sin x sin x + sin x sin x tan x tan x = tan x tan x

and

C = (sin x, tan x, sin x sin x, tan x tan x, sin x tan x, tan x tanx + 1)

D = (tan x, tan x tan x).

The basic transformation identities, and all other valid transformations learned by the model, have their asso­ciated characteristic sets. The characteristics are ordered by the same priority followed in the standard form.

TRANSFORMATION SELECTION

The major problem of any synthetically intelligent learn­ing model is selecting, from all of the possible operations, that which is most likely to bring about the solution to the problem. If a complete search is made, the result is an exponentially increasing number of trials. In the limited

Page 4: Computer learning in theorem proving

118 I E E E TRANSACTIONS ON SYSTEMS SCIENCE AND CYBERNETICS DECEMBER

problem of trigonometric identity proofs, if the number of identities is kept small, such a search could be executed. I t is considered, however, that such repetitive trial and error methods are less interesting and informative than the methods developed here.

Selection for Simplification A student trying to learn to solve trigonometric identi­

ties tends to use certain strategies. There is a tendency to simplify both sides of the identity, where this is possible, without paying too much attention to the ultimate goal of transforming the two sides into identical forms. In model­ing such behavior for simulation by computation, each available transformation is given a simplification score wThich indicates the number of objects which will be re­moved when the identity is applied to a problem. The simple trigonometric forms and the addition symbol are defined to be objects in this case. The simplification scores of the basic transforms are shown.

Simplification Identity (ID) Score

ID 1 sin a: = cos x tan x 1 ID 2 sin x cot x = cos x 1 ID 3 sin x cosec x = 1 2 ID 4 cos x sec x = 1 2 ID 5 sin2x + cos2x = 1 5

The only circumstances under which a simplifying transformation will be selected and used is when the com­plete side of the transformation containing the greater number of objects is present within the problem identity.

The following procedure is used in determining whether or not a simplifying identity can be used. The list of characteristics of the problem identity is scanned until a characteristic is found indicating that the complete side of one of the available transformations is present within the problem identity, as specified. The scanning continues, con­sidering all applicable transformations; if several possible simplifying transformations are found, the one with the greatest score is selected. The characteristic which caused this selection is also noted so it can be used in the trans­formation process to determine which term in the problem identity should be transformed. If more than one trans­formation is found with equal scores, the first one located is generally chosen.

To illustrate the operation of the model, using only the five basic transforms and the selection by simplification heuristic, consider the problem theorem

sin x cos #(tan x + cot x) = 1.

This is immediately rearranged into its standard form

sin x cos x tan x + sin x cos x cot x = 1

with the left-side characteristic list

C = (sin x, cos x, tan x, cot x, sin x cos x, sin x tan x, sin x cot x, cos x tan x, cos x cot x, tan x + cot x).

The selection process then scans this characteristic list

to find which elements are the same as the whole character­istic of a simplifying transformation. The first simplifying characteristic found is sin x cot x, and since no other is found with a higher score, the corresponding transforma­tion ID 2 is selected. Since the characteristic sin x cot x was responsible for this selection, this information is used to de­termine which term the transformation is to be made on.

The transformation is applied; and the new theorem is placed in standard form

sin x cos x tan x + cos x cos x = 1.

The new characteristic list contains sin x, cos x, tan x, cos x cos x, sin x cos x, sin x tan x, cos x tan x, sin x tan x + cos x, the final term being formed by cancellation of the common factor cos x.

The heuristic now selects ID 1, the result is put in standard form as

sin x sin x + cos x cos x = 1.

The theorem now has the same form as one of the available transformations and is, therefore, proven. The problem theorem and all theorems developed during the proof are now added to the list of transformations for use in later problems. The familiarity scores of the theorems used are increased by six, all other theorem scores are decreased by one. The available transformations are recorded accord­ing to familiarity-score rank.

Selection to Equate Sides The second strategy considered is that of selecting trans­

formations which tend to make the left-side function of the identity more like that on the right. The meaning of the word like is not entirely clear here. The student has a feel­ing about it. Unfortunately, when the computer is used to simulate such activity, vague feelings must be trans­formed into precise operations. Therefore, after the sim­plification strategy has been exhausted, if no proof has been found, the reduction to identity strategy is tried. This strategy depends upon a parameter called the applicability score. The most frequently used transformation having the highest applicability score is selected for application to the problem identity.

A precise description of the development of the applica­bility score is too extensive to include within this paper.1

The score is obtained for each transformation identity applied to the problem identity by a logical combination of the sets of characteristics. Assuming that Ak and Bk are, respectively, the right and left side characteristic sets of the transformation identity considered and C and D are those of the problem identity, we let

Ek = AkOBk(JC(JD.

By operating upon and mapping the results of this com­bination, a score is obtained to determine if application of

1 D. L. Johnson and staff, "Machine learning for general problem solving/' Rept. No. AF-AFOSR-486-46A, Air Force Office of Scien­tific Research, Aerospace Division, United States Air Force, Wash­ington, D. C , 1964.

Page 5: Computer learning in theorem proving

1966 JOHNSON AND H O L D E N : COMPUTER LEARNING 119

the transformation will force the two sides of the problem identity into equivalence, i.e., a form containing more com­mon characteristics. That transformation selected as most effective in making the two sides of the problem identity most nearly the same, or in cases of equivalent scores, the first obtained, is then subjected to further scrutiny by a similar but more detailed process to determine which side of the transformation identity should be applied at which point within the problem identity.

Weighting within the development of the applicability score acts to provide higher priority to the transforma­tion of more complex terms within the problem identity, tending to select transformations removing undesired complex terms or producing desired complex terms, and to reject transformations removing desired complex terms or producing undesired complex terms. The greater the com­plexity of the characteristic, the greater the influence will be upon the score.

Further insight into the development of the applicability score and its use may be found in an example in the de­velopment of a problem identity proof. A simple identity not included within the basic five will be considered, i.e.,

tan x cot x = 1.

As this is already in standard form, no rearranging is re­quired. The characteristic set of problem identity's left­side is

C = (tan x, cot x, tan x cot x).

It is determined that none of the basic five transforma­tion theorems in memory tend to simplify the problem identity. The applicability scores are then computed for each of the basic identities except those which have no characteristics in common with the problem identity. The only transformations having at least one character­istic in common with the problem identity are ID 1 and ID 2.

Initially, there are two scores determined for each trans­formation. One score indicates the desirability of applying its transformation to the left side of the problem identity, the other of application to the right side. Applicability scores for the fcth transformation are denoted by S\k and s2k)

where 1 and 2 refer to the left and right sides of the problem identity, respectively. Since only two transforma­tions require to be scored, there will only be four scores computed, i.e.,

Su = —12 s2i = —8

Su = —8 s22 = —12.

The characteristic lists of ID 1 and ID 2 are

Ci = (sin x) Di = (cos x, tan x, cos x tan x)

C2 = (sin x, cot x, sin x cot x) D2 = (cos x).

To calculate each srk, the lists of characteristics of the problem identity and ID 1 and ID 2 are compared. If any characteristic is present on one side of the problem identity and not on the other side, it would be desirable

to use a transformation which also has this characteristic on one side only. Since it is unlikely that a single trans­formation will cause the disappearance of all undesirable characteristics or the appearance of only desirable ones, positive scores are given for desirable situations and nega­tive scores are given for undesirable ones. Any transforma­tions with more desirable than undesirable features will, thus, tend to be selected on the basis of score.

The scoring system also gives greater weight to situations in which multiple characteristics rather than single char­acteristics are involved, the weighting being based on 2n

where n is the number of objects which are present in the characteristic.

Consider each of the characteristics, one at a time.

sin x: This characteristic is present only in the left side of ID 1 and the left side of ID 2. Since it is not present in the problem identity, this is considered undesirable and each score is decreased by 2n with n = 1, therefore,

Sn = Si2 = s2i = s22 = — 2.

cos x: This is present only on the right of ID 1 and the left of ID 2 and not present in the problem identity, thus, scores are again decreased by 2, giving

«11 = «12 = $21 = «22 = — 4 .

tan x: This characteristic is present on the right of ID 1 and the left of the problem identity. ID 2 tends to be unsuitable because the problem identity contains the characteristic while the transformation theorem does not. However, in such cases, the application of the transforma­tion will not be helpful or negative insofar as this charac­teristic is concerned and rather than the 2n weighting fac­tor, a score of — 1 is used. This yields scores of

Sn = — 6 s2i = — 2 $i2 = —5 s22 = —5.

cot x: This is present only on left of ID 2 and the left of the problem identity, making

sn = —7 s2i = —3 Si2 = —3 s22 = — 7.

Continuing with consideration to the more complex characteristics with their increased weight, cos x tan x scores — 4 against ID 1, sin x cot x scores —4 against ID #, and tan x cot x appears only in the problem identity and scores — 1 against both theorems. This leaves a final score of

Sn = —12 s2i = —8 $12 = —8 s22 = —12.

We see that the scores si2 and s2i have the same maximum value, so the first one of these computed would select the transformation, i.e., ID 1.

The modified score is then computed to determine whether the left or right sides of the transformation iden­tity should be used and to provide further discrimination as to whether the right or left sides of the problem identity

Page 6: Computer learning in theorem proving

120 I E E E TRANSACTIONS ON SYSTEMS SCIENCE AND CYBERNETICS DECEMBER

should be transformed. This process follows a procedure very similar to that just observed in the srk scoring. I t is determined that the left side of the problem identity should be used, and the right side of ID 1 applied.

Application of ID 1 to the problem identity leaves the resultant equation in standard form. The equation has the same form as one of the available transformation identities, ID 2, so, the problem identity is proven.

It should be noted that in more complex proofs, both simplification and equalization heuristics are required. Simplification is carried as far as possible; if this does not result in proof, equalization follows; then, simplification is again used.

In early stages of the model's development, a different subprogram was used to carry out each selected trans­formation. This tended to restrict the number of ways in which a given identity should be applied to effect a transformation. The general transformation routine, as currently used, determines which term of the transforma­tion should best be applied to which term of the problem. This is done by evaluation of that term from both the problem and the transformation identities containing the characteristic having the best score. Such an application is effective in both the simplification and applicability selec­tion.

If the association and selection processes of learning were to encompass every differentiable detail, both human and machine learning would require a prohibitive number of observations and decisions. Evidence seems to indicate that initial screening takes place by observation of gross categories, with increasing differentiation occurring only as detailed selection is approached. As the program described here solves an increasing number of problems, a large num­ber of available transformation identities are added to its memory. I t becomes desirable to design a process which will determine those transformation identities most likely to be useful and to initially scrutinize in detail, only those for application to the specific problem being considered. If this limited search does not result in a reasonably effective identity, a wider search can be instituted.

The method used in our model for the selection of only those identities most likely to be useful involves the re­moval of all simple (single or double object) characteristics from the characteristic sets of the available transformation identities. With this, each set of characteristics is exam­ined to determine if the remaining characteristics are present in the characteristic set of the problem identity. Scores are initially computed only for transformation identities having major characteristics in common with those of the problem. By leaving many of the elemental or simple characteristic sets open during the initial processing, there is a considerable reduction in the number of trans­formations scored.

In the event that initial search does not lead to a suit­able transformation, the search is gradually expanded with double object sets added to the characteristics evaluated; then, if necessary, the complete characteristic set is used.

T H E FAMILIARITY HEURISTIC

Within the learning model there are several attempts to consider the significance of familiarity in the learning process. The use of the standard form tends to structure the statement of the problem and transformation identities into a familiar format; new transformations as they are developed occur in the same structure.

Ordering and retention of transformation identities in memory is considered by means of a different kind of familiarity score indicating the frequency of use or utility within the environment of the model's experience. Every time a transformation is successfully used, the asso­ciated familiarity score is increased by a fixed arbitrary amount. Periodically, all familiarity scores are decreased by a small value, with the list of transformations ordered in memory in terms of their familiarity scores. Thus, when restricted searches are made, only recently successful trans­formations are examined. When the memory is limited, unused transformations are removed from the active mem­ory and maintained only for comparison with new trans­formations as they are developed. In the event that a trans­formation is dropped from active memory because of little use, as indicated by its familiarity score, and is then re­developed during a later phase of the learning experience, the new identity is reentered into memory with a higher familiarity score than if it had been the initial develop­ment. This is an effort to recognize, a very human charac­teristic, which may well be of value within the learning process.

EXAMPLES OF TRIGONOMETRIC PROOFS

By the means developed so far, the model can perform with apparent intelligence and learning. To evaluate these characteristics, it is important to observe the model in the solutions of increasingly difficult problems to find indica­tions of its ability to obtain more efficient solutions through experience.

A partial list of trigonometric identities used and de­veloped by the model is given in Table I, the first five of which are the basic identities initially provided to the model. The ones marked with asterisks were given as problems, the remaining identities on the list were gen­erated during the learning process. The problems were taken, in sequence, from textbooks. Table I also shows the identities required in the proofs of the problem identities. I t can be observed that the proofs made frequent use of the identities added during the learning experience. I t should also be noted that when an identity was proven, the dual was generated and accepted for future use, e.g., ID 11 was developed as the dual of ID 10.

An illustration of the learning properties of the model can be observed in the solution of problem identity 39 as shown in Table I. The experienced model proved the theorem in three steps, by application of transformation identities 10; 10; 14. Starting with no experience and only the five original transformation identities, the model re­quired ten steps to prove the theorem, using ID 1 five

Page 7: Computer learning in theorem proving

1966 JOHNSON AND HOLDEN : COMPUTER LEARNING 121

times, ID 4 three times, and ID 5 twice before finally ob­taining the transformation into ID 5.

On the other hand, there were very few theorems which the model was unable to prove, even though it had what was apparently sufficient background for solution. Analysis indicates that this disability was based on the single step evaluation made by both heuristics for trans­form selection. The existent selection is made on the basis of improvement under a single transformation. In the theorems the model could not prove, simplification or equivalence was attained only by the application of two or more transformations. The fact that this limitation con­fronted the model is less surprising then the fact that it was noted so infrequently. The processes under which se­quences of transformations can be evaluated and applied requires only modest changes in the details of the heuristic structures. Current research is considering this extension of the model.

PROGRAM ORGANIZATION

The computer program of the learning model was written in SNOBOL 32 and implemented on the IBM 7094. The order of operation is as follows :

1) The problem identity is placed in standard form. 2) The characteristic set of the problem identity is

developed. 3) Simplification scoring, transformation selection, and

application continues until there are no more additional simplifying transformations or until the theorem is proved.

4) Equivalence scoring, transformation selection, and application continue. If this does not prove equivalence, simplification is again attempted.

5) Standardization, development of modified character­istic sets, and cancellation take place after every trans­formation is applied.

6) If the problem identity is proved, its original state­ment and all transformed equivalents are added to the memory as transformation theorems.

7) Familiarity scoring is maintained to order and limit memory.

APPLICATION TO LOGICAL THEOREMS

Any evaluation as to the generality of the processes used within the learning model developed for trigonometry can best be observed by means of a different application. As the model is clearly structured for theorem proof, the problem of proving identities from the propositional cal­culus was chosen.

The two types of problems have striking differences in significant aspects. In the propositional calculus, any variable can be replaced by any other variable or com­pound expression if such replacement is performed con­sistently throughout the problem, i.e., the expression BCvABC can be simplified to AvBC by the transformation

2 D. J. Färber, R. E. Griswold, and I. P . Polonsky, "SNOBOL, a string manipulating language," J. ACM, vol. I I , no. 1, January 1964.

XvXY = XvY. The multiplicity of possible replacements can create a considerable difficulty in the realization of generality in the transformation process, however, the use of a fixed standard or representative form can again ease the difficulty.

Regardless of the initial form of the logical theorems, they were placed in alphabetic order, with variables posi­tioned within the alphabetically ordered system on the basis of their frequency of occurrence in the problem or in the transformation theorem. Replacement of any groups of variables was made by a single variable when no member of the group appeared in any other combination. The im­plication connective was replaced by more familiar ones. Examples of the foregoing statements are:

ΖΧΥνΖΫ becomes ΧΥΖνΫΖ

ZYXvY becomes XvXY

YZvl becomes Xvl

( X - * F ) becomes (XvY).

Even using the standard form, it is necessary to allow re­placement of variables when applying transformations in logical theorems.

The characteristic set was developed from the standard or representative form containing: every variable or con­stant, every pair of variables or constants separated by an and connective, and every pair of groups of variables or constants separated by and. Alphabetic ordering was used to fix order of appearance. Therefore, for

ABvÄCvÄ = ÄvB

the characteristic set for the left and right sides ar*

C = (A,Ä,B,C,ÄvAB,ÄvÄC,ABvÄC)

D = (Α,Β,ΑνΒ).

The different details of standardization are required by the fact that in the case of trigonometry, the presence of specific functions is necessary for the application of suitable transformations; while with logic, the requirement of maintaining specific functions is one which would be ex­tremely damaging to generality.

Current application of the general model to logical theorems only involves the applicability heuristic to bring the two sides of the theorem into agreement. Although there are many common elements in the scoring and trans­formation application between the trigonometry and log­ical implementation, there are also variations. As an ex­ample, in trigonometry after an element is transformed the original forms must be eliminated. In logic, however, even after a portion of the equation is replaced by a transforma­tion, it may remain to be used again. The replaced portion is not required within the equation statement, but may be useful in future transformations. Because of this property, the logical model requires the maintenance of an additional listing of groups which may be included in the problem identity but are not necessary for the equation validity.

As those who use the algebraic proof processes of

Page 8: Computer learning in theorem proving

122 I E E E TRANSACTIONS ON SYSTEMS SCIENCE AND CYBERNETICS DECEMBER

logical theorems are aware, it is often possible to be led into application of transformations which give every in­dication of success by the criteria of appropriateness, only to find that no transformation exists within individual experience to complete the process of equation. To consider this fact, the logical model is equipped with the ability to back up one or more ply in the theorem application process in the event that no applicable theorem can be found.

The logic model worked from a list of identities provided as axioms. These are given in Table II.

With this learned capability, the model proceeded to prove a series of simple but increasingly difficult theorems, including

ID 9* XvXY = X

ID 11* XYvXY = 1

ID 15* XvXY = XvY.

Although the problems are simple ones, the model makes use of its experience in improved solutions, expanding its available theorems for transformation.

TABLE I I AXIOMATIC LOGICAL IDENTITIES

ID 1 XvO ID 2 XI ID 3 XvX ID 4 XX

= X = X = X = X

ID 5 Xvl ID 6X0 ID 7 XvX ID 8 XX

= = =

1 0 1 0

GENERALITY

While the logical theorem proving model was not carried to the degree of sophistication of the trigonometric theorem proof application, certain conclusions can be drawn relevant to the generality of the learning model. The fundamental concept of the learning model is operable with considerable generality in these two and other considered theorem proving applications. The individual structures of various applications, however, may preclude direct ap­plication of one field of problems of a model that is detailed for another. I t should also be recognized that other, per­haps equally valuable, heuristics could be devised for application within the same set of problem solving models. The recognizable generality of such work, then, lies in the overall structure and organization of solution methods rather than the specific treatment of detailed data and unique recognition methods. Even here, there is a certain amount of generality within certain categories of applica­tions, but if any other than this limited generality between subsets exists, its definition is far from clear.

I t seems likely that, at least for a number of years, re­search in machine learning and intelligence will continue to strive toward any realistic generality, finding increasing commonality in heuristic functions and learning organiza­tion while at the same time noting divergence in detailed procedures. Indeed, a malleable structure, like a lan­guage, may at least temporarily be the best means of observing the properties and needs of machine learning.

CONCLUSIONS

I t is necessary, in any discussion of generality, not to claim false generality and obscure the problems which we face. Only by objectively examining the failings and deceits we practice, through the computer (which will never betray us), can we hope to effectively simulate human learning and intelligent behavior.

Let us evaluate the model described in this paper.

Positive 1) The model does indeed learn to solve increasingly

difficult problems, spending less time in trial and error searching as its appropriate experience increases.

2) If provided with similar problems, it will never solve both in the same way, but use its experience with valid generality.

3) The model uses the existence of specific objects and combinations of objects to guide decisions. Its lists of such objects are governed by the experience to which it has been exposed.

4) The model categorizes and associates with increasing efficiency during its learning experience.

5) The model uses an increasing but limited memory, remembering only those things which it has found most useful, with consideration of items in memory performed in the order of the item familiarity.

6) On occasion, the model makes a more efficient solu­tion without experience than it does with what is con­sidered to be adequate experience. This occurrence, how­ever, is very rare.

7) The model tends to "take care of the big problems first, then worry about the small ones."

8) The model on occasion cannot solve a problem that would seem to be entirely within its capacity.

9) The basic structure of the model and its heuristics perform satisfactorily in at least two different applications.

Negative 1) The model does not use one type of association in

recall. Remembering one related fact does not unearth other semiforgotten facts.

2) The basic heuristics opérant within the solution do not structurally improve (although the information upon which they work does) with experience.

3) Although the basic heuristics have certain generality, the basic data-handling processes do not.

4) Facts that are remembered, are remembered without fallibility, but facts forgotten are forgotten completely.

5) Heuristics for selection considering sequences of transformations have not yet been documented.

6) Proof of theorems is only a small subset of problem solving ability.

Proof of theorems is clearly only a miniscule part of the totality of problems solving ability. Before machine in­telligence can be considered a bread-and-butter part of com­puter activity, it is necessary that learning and perform­ance extend to solutions of open-ended problems. Such an extension may be brought much closer because of continued

Page 9: Computer learning in theorem proving

I E E E TRANSACTIONS ON SYSTEMS SCIENCE AND CYBERNETICS VOL. SSC-2, NO. 2 DECEMBER, 1 9 6 6

work in theorem proving models. Many of the heuristics considered are valid for both types of problems, i.e., stand­ard form, simplification, etc. Others may be validly used if the structure of the solution is known. In certain instances, generalization as to solution structure can be considered through the use of learned characteristic sets, such as those developed for transformations and theorems within the theorem proving model. Such characteristic sets can pre­

dict occurrences and structures within the solution form on the basis of the learning environment as made up of theorems or successful solutions. Higher-order learning mechanisms will also be required for problem association for gross categorization. I t is essential that there be clear recognition of the immediate and potential value of the many diverse researches in the simulation of human in­tellectual abilities.

Models for Railroad Terminals CHARLES B. SHIELDS

Abstract—This paper describes the recent work which has been done by Battelle Memorial Institute in the development of models for the study of railroad terminals. Specifically, these models are mathematical or logical models programmed for the digital com­puter and reproduce the operating characteristics of railway classi­fication yards.

The problem of the amount of detail which should be included in the logic of the model is discussed and the factors which influenced the choice of the depth of programming are reviewed.

The philosophy, structure, and general characteristics of one terminal model are discussed as well as some typical applications.

INTRODUCTION

THE MODELS referred to in this paper are what are commonly referred to as mathematical or logical

models. These models are computer programs that cause certain inputs to be processed and outputs to be produced in accordance with a mathematical formula or a set of logical rules in a manner simulating the real-life system or process being modeled. Here we are discussing a model of a railway terminal or, to be more specific, a railway classi­fication yard. The type of classification yard may be either a gravity hump yard of considerable complexity or it may be a flat switching yard, many of which are in railroad serv­ice today.

In model building, perhaps the most difficulty is experi­enced in determining to what depth of detail it is advisable to go. In spite of the care taken to design a model that does not go into too much detail, what usually happens is that the model tends to grow in complexity and in detail until, in many instances, the bounds of practicality are exceeded. I t is not suggested here that the design should start off with less detail than required, only that the tendency to over-detail the model must be continuously recognized. If the detail is too great, then the model requires not only an

Manuscript received March 1, 1966. The author is with the Systems Engineering Division, Battelle

Memorial Institute, Columbus Laboratories, Columbus, Ohio.

extremely long programming time, but also too much running time, and becomes expensive to operate. Further­more, if there is too much detail, the gathering and prep­aration of the input data becomes prohibitively lengthy. Lastly, too much detail creates a monstrous amount of detailed output which in many cases is so voluminous that there is not enough manpower available to analyze it.

Of course, the converse is true, that if there is too little detail the model is inaccurate, and in the extreme cases becomes worthless as an analytical tool. I t was Battelle's experience that the counsel of the railroad operating per­sonnel was invaluable in obtaining this balance of detail. However, it was found that those operating personnel who were closest to the actual operation involved tended to call for more and more detail. Those personnel who were fur­ther away from operational detail and yet who could look at the operation from a system standpoint could give valu­able suggestions as to the value of some of the detail.

Actually, with respect to the terminal model, two differ­ent models were developed. The first one was developed in great detail, the action being simulated on a car-by-car basis. In this model, complete car records could be developed. The model was excellent for studies where this degree of detail was required, but the use of tape overlay, with the resulting slow running time, was required.

The second model, Model II, was developed so as to over­come the slow running time; as a result, the running time for a typical 10-day period was measured in minutes, as compared with the hours required with Model I. Model I I does not keep a car-by-car record, but rather a car-group becomes the unit that is processed by the model. This grouping of cars is based upon 1) the arrival train, 2) type of traffic, and 3) destination. As a train arrives, all cars carrying the same type of traffic and going to the same destination are classified as a group. A group, therefore, may consist of one or more cars. The groups are formed as part of the input-data preparation for the model. Figure 1 displays the modeling concept employed.

123