Dynamic dataflow analysis

Dynamic dataflow analysis T Y CHEN and P C POOLE

Abstract: This paper covers a comparison of the various solutions to some problems eft'dynamic dataflo w analysis and proposes new solutions to some of these problems. It st#nmarizes experiences ¢~f implementing and using several dynamic data~low analysis systems. Some proposals about ,[htttre directions aml the essential features of automated dynamic dattfflow analysis systems are also discussed in this p~qwr.

Kevwords: datqllow amdvsis, progam testing, sof?ware reliability

T he technique of dataflow analysis was initially applied to the study of code optimization I. Fosdick and Osterweil 2 were the first to suggest

the applications of dataflow analysis in progam testing. They proposed a method for detecting dataflow anomalies which intuitively mean improper uses of data. In their method, the dataflow analysis is performed only on the source code of programs without actually executing them. Such an approach is said to be static. They developed the first static dataflow analysis system for FORTRAN called DAVE 3.

The static approach to dataflow analysis has many limitations (e.g. the analysis of individual array elements is complicated; the analysis of pointers is incomplete; the handling of recursive calls for subprograms is very difficult). Huang 4'5 therefore proposed another approach. By using the technique of program instrumentation, software probes to trace dataflow could be inserted into the original source program to generate an instrumented program. Dataflow analysis of the source program could then be performed through the execution of its corresponding instrumented version. Such an approach is said to be dynamic as the analysis is carried out along with the execution of the instrumented program. Recently, many dynamic dataflow analysis systems have been implemented and reported in the literature as being very useful for detecting and locating errors. For example, DFA 6 is a dynamic dataflow analysis system for FORTRAN, AIDA 7 is for PASCAL, DDF s is for c and COD 9 is for COBOL. In particular, the dynamic approach can handle the

Department of Computer Science, University of Melbourne. Parkville 3052. Australia

dataflow analysis of pointers, array elements and recursive calls of subprograms more satisfactorily than can be done in the static approach.

The current literature has addressed individual aspects of dynamic dataflow analysis in a somewhat separate and scattered manner. In this paper, the current status of the topic is reviewed and various solutions are compared with the major problems encountered. The authors summarize their experience in implementing and using DFA, AIDA, DDF and COD. Also future directions are discussed as well as the important and essential features required for automated dynamic dataflow analysis systems.

During the execution of a program, there are normally three possible actions that can be performed on a variable, namely, define, reference and undef ined (denoted by d, r and u respectively). A variable is said to be defined when a value is assigned to it, referenced when its value is read nondestructively and undefined when its value becomes unknown or inaccessible.

Dataflow analysis is concerned with whether data variables are used in a proper way or not, in other words, whether the sequence of actions on a datum is sensible. For example, a sequence consisting of an undefine action followed by a reference action is improper, as it makes no sense to reference an undefined value. An improper usage of data is called a dataflow anomaly and is indicative of questionable coding. Given the above three actions, the following sequences are obviously anomalous:

• u r - an undefine action followed by a reference action (uninitialized variable);

• d u - a define action followed by an undefine action (unused definition);

• d d - a define action followed by a define action (redundant definition).

It must be stressed that an anomaly is not necessarily an error and that a programming error does not necessarily belong to one of the above three types of anomalies. Intuitively, an anomaly indicates that something is irregular. Thus, the presence of an anomaly only implies that the program may contain errors.

Huang s pointed out that dataflow analysis could be carried out dynamically through the technique of program instrumentation and proposed the tracking of state transitions instead of sequences of actions. Corresponding to the above types of actions and

vol 30 no 8 oc tober 1988 0950-5849/88/080497.09503.00 © 1988 Butterworth & ('o (Publishers) lad 497

u

u , d , r

Figure 1. State transition diagram

anomalies, a variable can be assumed to be in one of the following four states:

State U :undefined State D :defined State R :referenced State A :abnormal

His state transition diagram is shown in Figure 1. When a variable is in state A, it means that a dataflow anomaly involving that variable has occurred.

Two differences should be noted about the state transition method as compared with the monitoring of sequences of actions. First, no distinction is made between the types of anomalies. This can be remedied simply by using a variety of abnormal states to cater for various anomalies or by storing the types of the two most recently applied actions so as to be able to derive the anomaly type upon entering the abnormal state. The latter approach has been used in A I D A , D F A and COD, as the types of the two most recently applied actions are also the useful information needed for identifying the source and location of the anomaly. Second, dataflow analysis for a variable will terminate upon the occurrence of an anomaly. As can be seen from the state transition diagram, a variable, once it enters state A, will remain there forever. Solutions to this problem will be discussed later.

Dynamic dataflow analysis

In this section, some important aspects of dynamic dataflow analysis are reviewed. In addition to comparing various solutions for the major problems of dynamic dataflow analysis, some new approaches are also presented.

State transition diagrams

For efficiency of implementat ion, Huang 5 proposed the monitoring of state transitions instead of sequence of

actions. Kao and Chen HJ have recently pointed out that although Huang 's state transition diagram (Figure 1) was defined without explicit reference to any high level programming language, FORTRAN was in fact the language implicitly referred to when the notion of the dataflow analysis was being developed in the context of program testing. They observed that this state transition diagram is inadequate for the dataflow analysis of COBOL programs. They proposed some new actions, states and state transition diagrams for this language. The major conclusion of their study is that for bet ter anomaly detection and location, the characteristics of data types, data structures and operations on data should be taken into consideration when defining the types of actions, the types of states and the state transition diagrams for a particular programming language. Osterweil et al. ~ ~ have also put forward some suggestions for the analysis of COBOL.

Huang 's state transition diagram (Figure 1) assumed that once a variable enters the abnormal state, it will remain there. In other words, there is no at tempt to continue further analysis for an anomalous variable. The argument is that there is insufficient information available to know in which state the anomalous variable should restart after it has entered the abnormal state. DFA, A I D A and COD have all adopted this principle.

In fact, Huang proposed two methods of continuing the dataflow analysis of variables even if anomalies have occurred. In his study of how to apply dataflow analysis to a portion of program, he solved the problem as to what should be the most appropriate state value to resume the analysis. After an exhaustive examination of all possibilities, he reached the conclusion that in the absence of any other information state R is the most desirable value with which to restart. A detailed discussion can be found elsewhere 5. Thus, an obvious strategy is to reset that state of a variable to R upon entering the abnormal state A (referred to as Huang 's first method). The other strategy uses a new state transition diagram (referred to as Huang 's second method). In view of the absence of information about the types of anomalies in the original state transition diagram as shown in Figure 1, he proposed replacing the unique abnormal state A by three anomalous states, namely the define-define state DD, the define-undefine state DU and the undefine-reference state UR. This gives rise to a new state transition diagram as shown in Figure 2, which permits the continuation of dataflow analysis even if anomalies occur. Obviously, a similar extension of Huang 's second method can easily be applied to the state transition diagrams used in DFA, A I D A and COD.

D D F which analyzes c language programs at tempts to continue the analysis even if an anomaly has occurred. It assumes that actions r, d and u will set the state variables to R, D and U respectively. By recording the type of the most recent action, D D F is then able to detect the anomaly. In our view, this methodology represents a combination of the above two proposals made by Huang, as illustrated by the following

498 information and software technology

u

u d

u r d d r

F


examples. Let us look at the case where a variable is in state U and a sequence of actions urr is applied to it. D D F will detect a ur anomaly. Huang 's first method also detects the same ur anomaly while two ur anomalies will be reported by Huang 's second method. Now, let us consider another case where the variable is in state U and a sequence of actions ddd is applied to it. D D F will report two dd anomalies. Huang 's second method also detects the same two anomalies; however, only the dd anomaly due to the first two define actions in ddd would be reported by Huang 's first method. Since urr contains one ur anomaly and ddd contains two dd anomalies, it would seem that DDF ' s method is bet ter than either of Huang 's two methods with reference to the preservation of the information associated with the sequence of actions.

State variables

As explained in the previous section, state transitions instead of sequences of actions can be monitored to detect dataflow anomalies. A state variable is therefore required for each data object under analysis. In the early systems, explicit state variables were inserted into the source programs. Basically, the technique used was to build up a parallel data structure of state variables for the objects to be analysed. However , this approach is not satisfactory for data structures such as linked lists, because of the unacceptably heavy overheads incurred when the linked list of state variables has to be traversed in exactly the same order as the traversal of its corresponding linked list of data objects. For such data structures, Chan and Chen 7 proposed a method involving the use of embedded state variables. Howev- er, their proposal still suffers from the disadvantage of being unable to analyse programs with external procedures whose sources are not available for program instrumentation.

Recently, Price and Poole ~ introduced the notion of implicit state variable which provides a bet ter solution

to the problem. Instead of explicitly declaring state variables, the address and size of the data object are used as keys for defining the state variable without assigning it an explicit identifier. In C, for example, the subroutine call def(&x, sizeof (x)) is the instrumentation corresponding to the case of having a define action on the data object x, where & is the address-of operator and sizeof is the size-of operator. This avoids all previously mentioned problems caused by using explicit state variables.

Until recently, software probes to update the state variables were inserted as separate statements into the source programs (e.g. in DFA, COD and A I D A ) . Obviously, it is essential to know the exact sequence of actions in the original source program for the correct updating of the state variables. The analysis of this sequence is a static one. However , there are situations where the sequence is either undefined or execution dependent . For example, the c programming language contains operators that evaluate operands in an unspecified order. In such situations, the earlier method needs to decompose the original statements to ensure the correctness of the instrumentation. Such a decomposit ion is not only language dependent but may even be compiler dependent. This makes the system much less portable. To remedy this situation, a new method known as functional instrumentation was adopted in DDF. The basic idea is to transform the original source statements into equivalent ones with software probes for state transitions embedded at appropriate positions in the reconstructed statements so as to be able to record the correct sequence of actions as closely as possible. This method is not only conceptually simple but is also easy to implement. As an example of illustration, suppose x is an integer variable and y is a character variable, the instrumentation for the c statement:

x=x+y

is:

(x = *(int*) ref (&x, sizeof(x)) + *(char*) ref (&y, sizeof (y)), def(&x, sizeof(x))

where the expression *(int*) ref(&x, sizeof(x)) records x as being referenced and accesses x as an integer datum.

Data f low anomal ies

Although anomalies may be detected during dataflow analysis, the final objective is to uncover errors. However , there are many situations in which an anomaly is not an error. Thus, to locate errors among anomalies may be a t ime-consuming task. However , even if the detected anomalies are not errors, they are usually indications of questionable coding or bad programming practices, for example, the lack or unnecessary initialization of variables ~'. In this sense, the detection of such anomalies can be used to reveal impurities in the software, the removal of which will not only improve its quality but may also reduce

vol 30 no 8 o c t o b e r 1988 499

maintenance costs. If one adopts this view, a dataflow analyser becomes more than just a program testing tool.

To handle the problem that an anomaly may not be an error and still be able to locate errors automatically, Chan 12 introduced the classification of fatal and nonfatal anomalies. An anomaly is defined as fatal if it is definitely an error while a nonfatal anomaly is not necessarily so. Obviously, for a simple scalar variable, a ur anomaly is always fatal. He also pointed out some conditions under which a dd or du anomaly is a fatal one. Given the notion of fatal and nonfatal anomalies, the process of locating errors amongst anomalies can be partially automated 6' 7.

Two levels of dataflow analysis can be provided if one uses the notion of fatal and nonfatal anomalies. One level will only report the fatal ones ones while the other will report all anomalies. Future dataflow analysers should include such a facility to satisfy demands of different users (or even the same user at various stages of the software development life cycle).

During dynamic dataflow analysis, it is common to have a situation where an anomaly is detected for an element of an array (or other data structure) and then similar anomalies are subsequently found for other elements of the same structure. It would be somewhat irritating for a user to examine an anomaly report and discover that all the anomalies were caused by the same problem. An obvious solution is just to report the fatal anomalies and suppress the output of nonfatal ones unless they are requested explicitly. It is also most unsatisfactory to have the anomalies treated as unrelated in the anomaly report if they are all caused by the same code sequence in the program. A possible solution for this problem is proposed later in which the dataflow analysis of arrays is discussed.

Obviously, most of the false alarms are likely to be dd or du anomalies. The nature of these anomalies has been investigated by Price and Poole s who refer to them as dead definitions since the first action d is virtually useless for dynamic dataflow analysis. They observed that the reason for having so many false alarms due to dead definitions is because the conven- tional criterion for detecting such anomalies is rnemorv- based which means that dead definitions for different objects are regarded as distinct, even if all of them are due to the same code sequence. DFA, A I D A and C O D have also used the memory-based approach. A more appropr ia te approach however for the analysis of dead definitions is a statement-based one which means that a definition by a particular piece of code is flagged as a dead action only if it will never be referenced. Thus, there exists a dead-on-all-paths criterion which means that a definition in a code sequence is dead if it will not be used on any executable path. This criterion will reduce considerably the number of false alarms due to dd and du anomalies. Unfortunately, the direct implementat ion of a dead-on-all-paths criterion is not feasible in dynamic dataflow analysis, as this approach has no knowledge about those paths which have not yet

been executed. For implementation purposes, a definition is then regarded as alive if it is used on some executed paths and a d d or du anomaly is regarded as a false alarm if its first definition is alive. Obviously such an implementat ion is still able to eliminate many false alarms due to dd and du anomalies.

False alarms due to ur anomalies are less frequent and are usually associated with structured data such as records in PASCAL or group variables in COBOL. Normally, if an action is applied to the whole data structure, the same action would be assumed to be applied to all of its components. For example, in PASCAL, if a record X is assigned the value of another record Y, there is obviously an action r on Y and a similar action r on all the fields of Y. However , this argument must be used carefully as it may give rise to false alarms for the ur anomaly in certain situations. Under some circumstances, it would be quite natural and usual to have some but not all fields of Y as undefined. If the above principle is followed, there will be false alarms for ur anomalies for the undefined fields of Y. Obviously, the ur anomalies on the undefined fields of Y should be reported if all fields of Y are undefined.

Segmenta t ion o f dataf low analysis

Normally, the initial state of a variable is U 5. For COBOL, the initial states of variables are determined by how they are declared in the Data Division m. Where a portion of a program is to be analyzed, in the absence of any other information, the most desirable state value to start the dataflow analysis 5 is R.

Huang has investigated the problem of how to perform dataflow analysis across the boundary of two subprograms. The basic idea is to use queues for the transfer of state values between the formal and actual parameters so that the dataflow analysis can be carried over the boundary of the subprograms. In his method, he defines a subprogram called que(func, var), where the first parameter can be in or out and the second paramete r is a state variable name. Before the subprogram is executed, calls are made on que to cause the state of the actual parameters to be placed on a stack. Immediate ly after the subprogram has been entered, the states are retrieved from the stack and assigned to the states of the corresponding formal parameters . Prior to exit from the subprogram, the updated states are again preserved on the stack and retrieved again after control is returned to the calling progam.

Chan and Chen l~ have adopted a totally different approach to solve the problem of dataflow analysis across a subprogram boundary. The basic idea of their method (hereafter called the CH2 method) is to treat different subprograms as totally independent segments. The dataflow analysis on different subprograms are therefore totally independent of each other and there is no need to set up any mechanism for the transfer of the state values between the formal and actual parameters


kl

ct,u I


of the subprogram. Thus, their method is in keeping with the methodology of modular programming in the development of large software systems. In this method, a variable will be in one of the following seven states:

State U : undefined State D : defined and not referenced State R : defined and referenced State A : abnormal State DS : defined and should be referenced (can be

defined later) State DR : defined for referenced only State RO : for reference only

The state transition diagram is shown in Figure 3. Parameters are classified according to the way they

are used as follows:

• input parameters which are left unaltered during the execution of the subprogram,

• input parameters which also act as output parameters,

• input parameters which also act as working variables during the execution of the subprogram,

• output parameters .

Suppose Fi, Fi,, Fi,,. and Fo are used to denote the above four types of formal parameters respectively and the corresponding types of the actual parameters are Pi, rio, Pi., and P,. The basis of the method is that it can be assumed that the effect of the call s tatement in the calling subprogram is to apply the actions r, rd, ru and d on actual parameters of types P~, Pso, Piw and Po respectively. The states of the actual parameters are therefore updated immediately after exit from the called subprogram thereby revealing any anomalies in the normal manner. In the called subprogram, the initial state of formal parameters of types F/, F/o, F/w

and Fo are set by instrumentation code to DR, DS, DS and U respectively. Instrumented probes are also inserted just before the return to the calling subprogram to check whether the formal parameters are in the appropriate states or not in order to increase the anomaly detection and location capabilities. DR is now an anomalous state for a formal parameter of type Fi since by definition it will be referenced in the called subprogram. Similarly state DS for parameters of types Fio and Fgw is anomalous since the called subprogram is expected to define these parameters. Also parameters of types Fio and Fgw will be referenced in the calling subprogram as soon as exit occurs from the called routine. Therefore, they can be tested to determine whether an r action will produce an anomaly.

It should be obvious from the above discussion that where there are no global data objects (excluding parameters) , the dataflow analyses of two subprograms are totally independent of each other. An immediate consequence is that the analysis of any one subprogram can be suppressed without affecting the analysis of the others.

In the above method, there are no queues and there are bet ter anomaly detection and location capabilities under most circumstances; the analysis of subprograms can be suppressed independently, that is, the method can still be applied even if some subprograms are not available for instrumentation; also, it tries to detect as many anomalies as possible in one run. Examples showing improved anomaly detection and location capabilities as well as the other advantages can be found eslewhere 13. However , this method does depend on having access to the classification of parameters and sometimes this is only available at runtime.

DFA, A I D A and COD all provide users with two levels of analysis for subprograms. First. there is an option of using either Huang 's method or the CH2 method. Second, when the latter method has been selected there is an option to suppress the analysis of particular subprograms. From the authors" experience in using these tools, this type of facaility has been found to be very useful and is one that should be provided in any dynamic dataflow analysis system.

Arrays

One of the major advantages of dynamic dataflow analysis over the static approach is its ability to analyse the dataflow of each individual array element. For static analysis, since it is not possible to know in advance the exact array element references, the usual practice is just to treat the whole array as a single entity. Obviously, such an approach is not entirely satisfactory. For dynamic analysis, since the array element references are computed during execution, dataflow analysis of array elements can be performed individually. In this aspect, the dynamic approach is much better than the static one. However , there is a side effect. Since array elements are usually processed in a very similar manner, when an array element has a

vol 3(1 no 8 october 1988 51)1

dataflow anomaly, the same type of anomaly will frequently be found for many other elements of the same array. For large arrays, this may be very annoying especially when these anomalies are caused by the same problem.

From the authors' experience in using dataflow analysis systems such as DFA and AIDA, it is concluded that it would be preferable to have a preliminary analysis of the detected anomalies and then reformat the anomaly messages in a more concise form before outputting them to the user. One obvious approach is that for the same array, when the anomalies for the array elements are of the same type and have improper combinations of actions due to the same code, it is sufficient just to report the first occurrence of each such anomaly. This will reduce considerably the number of anomaly messages. Obviously, this approach can be extended to other data structures whose elements are usually processed in a similar manner.

Pointer variables

Another major advantage of the dynamic over the static approach is its ability to perform better dataflow analysis for pointer variables. Recently, Wilson and Osterwei114 developed a static dataflow analysis tool for the c programming language, known as O M E G A . They pointed out the difficulties encountered in performing static dataflow analysis for pointer variables. In O M E G A , some pointer assignments are ignored and not analysed. However, a very comprehensive and thorough dataflow analysis of pointer variables is supported by DDF,

As an example, let us consider the situation where a pointer variable is assigned to point to a data object. Suppose the data object now ceases to exist and the pointer variable is then referenced. Clearly, there is an anomaly associated with the pointer variable. Obvious- ly, detection of such anomalies is extremely difficult in static dataflow analysis. However, the problem can easily be solved in the dynamic approach. The basic idea is simply to assign to every data object a unique identifier, say an integer. When a pointer variable is assigned to point to a data object, the state variable of that pointer can also be given the same identifier as the data object. Identifiers of active data objects are stored in a table. Before a pointer variable is accessed, its identifier is checked against this table. If its identifier is not in the table, an undefined action is then applied to the pointer variable whose state variable is subsequently set to have the same identifier as that of the null data object.

In addition to being able to detect anomalies due to dangling pointers, DDF also supports the detection of anomalies where a pointer lies outside its intended region of indirection. Although the scope of the intended region can only be loosely defined, there are some clear examples of this type of pointer range anomaly. Consider the following c program:

substring (index, count) int index, count; { externcharbuf]101:;

register char *p, *q; p = &buf{indexl; q = &bufIindex+count ; while (p<q)

putchar ( *p), + +p; }

It is obvious from the program that the pointer variable p is intended to point to the elements of the array buf. However, if the call substring(97, 5) is executed, p will contain the address of bur 197] on the first iteration but the address of buf 11011 on the last iteration. This is regarded as a pointer range anomaly as p points outside the array bur which is obviously the intended region of p in this example.

The pointer range anomaly is intuitively modelled on the array bounds error. However, it should be noted that pointer variables may be used very freely to point to even totally unrelated locations on different instances, while the subscript expressions are bounded by their lower and upper bounds. The notion of base was introduced by Price 15 to formalize the analysis of these pointer range anomalies. The base of an address expression is defined as that component of the expression which defines the normally intended region pointed to by the expression. The base of a pointer variable is defined to the base of the address expression which has most recently applied a define action on the pointer variable.

The bases are classified into six types:

(1) Z E R O base is for the null pointer, (2) B L A N K base is for the indeterminate pointer, (3) FIXED base is for the address of a simple scalar

variable, (4) D I R E C T base is for the address of an aggregate

data structure, (5) P O I N T E R base is for the address of a pointer

variable, (6) I N D I R E C T base is similar to P O I N T E R base but

is only for pointers accessed through indirection.

Every base type is associated with a set of permissible addresses. For example, Z E R O base has the empty set while D I R E C T base has the set of addresses of all elements of its associated aggregate data structure e.g. if s is an aggregate data structure and x is a member of s, (char *)&s. x + l has the base D I R E C T s. Details about the handling of pointer range anomalies can be found elsewhere s' 15

Comple teness o f the detection o f dataf low anomal ies

Another criticism of dynamic analysis concerns the completeness with which dataflow anomalies are detected. Obviously, which anomalies are detected dynamically is determined by the control path followed during execution. However, a program usually has a very large number of, or even infinitely many, possible


execution paths. Therefore, it is very useful to know which possible execution paths can be ignored without leaving some anomalies undetected. Huang 5 has found a very simple answer to this problem. For the detection of all ur, dd and du anomalies, he has proved that it is only necessary to execute all possible execution paths that iterate a loop zero or two times. His result is very useful as it provides a mechanism for partitioning the set of all possible execution paths into domains and for selecting the sufficiently powerful enough representa- tives from each domain.

In the COD system for dataflow analysis of COBOL programs, a path report containing information about the paths executed and the coverage of executed statements is produced along with an anomaly report. From the authors' experience with this system, they believe that this kind of path report is very helpful in selecting test data for further analysis of the programs, especially in the light of Huang's result. It shall be available in any good dynamic dataflow analysis system.

O v e r h e a d s

All dynamic dataflow analysis systems developed to date have separate instrumentation and compilation phases, that is, the source program is instrumented first and then the instrumented program is compiled for execution. Since the instrumentation phase needs to perform a great deal of syntactic analysis, there are overheads due to the repetition of this analysis during the compilation of the instrumented program. The main reason for having separate instrumentation and compilation is ease of implementation and portability. Obviously, these kinds of overheads could be removed if the instrumentation is incorporated into the compilation phase.

Another more serious overhead is due to the possible repetition of dataflow analysis on different runs or even on the same run. These overheads can be partially removed by the CH2 method to suppress the analysis of any subprogam that does not contain any global data objects (excluding parameters). Once a subprogram has been thoroughly tested, further dataflow analysis could be avoided in subsequent runs. DFA, AIDA and COD all provide the user with the option to suppress the analysis of any subprogram.

In this paper, a further method is proposed for reducing the overheads involved in dynamic dataflow analysis. The idea comes from Forman's ~6 recent study on the formalization of the detection of dataflow anomalies using regular expressions. In the method explained in this paper the executable part of the program is divided into instrumentation blocks (also known as basic blocks). Such a block is simply a sequence of consecutive actions which may be entered only at first action and, once entered, is executed in sequence without stopping or branching except at the last action. Actually, AIDA performs this kind of decomposition for the purpose of locating an anomaly more exactly.

For any scalar variable x in a block, let axsxbx denote the sequence of actions applied to x within this block, where ax denotes the first action, bx denotes the last action and sx denotes the sequence of actions between ax and b~. A simple static analysis of a~sxb~ can determine whether this sequence of actions contains an anomaly or not. The rationale behind this method is that since the sequence a~sxb,, is fixed, its analysis need be performed only once.

Let us use f to denote the state transition function and x_state to denote the state variable of x. For Huang's state transition diagram (Figure 1), the program instrumentation for the dynamic dataflow analysis of variable x within the block is simply:

Case (1): a ~ b ~ contains an anomaly

x_state = f lax, x_state) ifx_state=/:A then xs ta te = A /* report the anomaly of a~sxb~*/

Case (2): aA~bx does not contain an anomaly

x_state = f lax, x_state) if x_state4=A then x_state = S

where S is D, R or U, if bx is d, r or u respectively (note that the action of bx can be determined statically); the message enclosed by/* and */is the explanatory note. When a~ is the unique action within the block, the final instrumentation statement in case (2) can be omitted.

Let us now consider the correctness of the above instrumentation. Here we assume that the anomaly report routine will only be called on the first entry to state A, that is, analysis of a variable will terminate upon detection of an anomaly. For simplicity, the software probes for reporting and locating anomalies are not included. The first instrumentation statement in case (1) detects whether a~ and and the preceding action will induce an anomaly or not thereby ensuring that the first time an anomaly is detected it is properly reported. The second instrumentation statement mere- ly sets x_state to A if it is not already in that state because of the preceding call on f. For case (2), there are only two possibilities, namely, whether ax will give rise to an anomaly or not. If an anomaly does occur, then flax, x_state) will assign A to x s t a t e which then will remain in this state throughout the remainder of the computation. Otherwise, any anomaly can only be due to bx and the next action. Now, the problem is which value should be assigned to x_state immediately after the application of bx so that the dataflow analysis of x can be continued in the next block. Consider first the situation where b, is d. Since there is no anomaly in the block, the state of x when b, is applied cannot be D and can therefore only be U or R. Hence, it follows from Figure 1 that x_state is D immediately after the application of bx irrespective of whether it was U or R before the application of bx. Similarly, it can be shown that x_state is R or U immediately after the application of b,. if b,. is r or u respectively. Note that the instrumentation statement, i f x_state4:A then x s t a t e =

vol 3/) no 8 october 1988 503

S, in case (2) is only needed for bx when the block contains more than one action. Hence it can be omitted when ax is the unique action within the block.

For dataflow analysis systems which use the CH2 method for subprograms, it is obvious that the same instrumentation as given above will apply except for the case where x is Fg and a.~sxb~ does not contain an anomaly. In this situation, the instrumentation is:

x_state = f(ax, x_state) i f b x = r

then { i f x_state4= A then x_state = R O }

else x_state = A

Again, the second instrumentation statement may be omitted if a~ is the unique action on x within the block.

In the section entitled 'dataflow anomalies' it was pointed out that the method used in DDF for the continuation of dataflow analysis after the occurrence of an anomaly is by far the best solution with respect to the preservation of information associated with the sequence of actions. Furthermore, this method fits well with the current proposal since it assumes that actions d, r and u will set x _s t a t e to D, R and U respectively irrespective of the previous value of x_s ta te . For the state transition function used by DDF, the instrumentation is simply:

x_state = f ( ax, x_state) x s t a t e = S

/* above not instrumented if the block contains a single action */

x_state = A /* above not instrumented if no anomalies within the block*/

where S is D, R or U if bx is d, r or u respectively.

Conclusion

The major disadvantages of static dataflow analysis are difficulties in analysing individual array elements, limitations on the analysis of pointers, problems in handling recursive subprogram calls and impossibility of detecting some nonexecutable paths. On the other hand, dynamic dataflow analysis suffers from the disadvantages that the analysis is input-data dependent. On the one hand, the analysis may only be partial since some paths in the program may never be traversed. On the other hand, there may be unnecessary repetition of analysis since different test data may cause the same path traversals.

However, it should be noted that the static and the dynamic approaches are not competing with each other. Rather they are complementary as can be seen by comparing their disadvantages. In the dynamic approach, individual array elements can be analysed easily, pointers can be analysed more thoroughly, analysis of recursive calls of subprograms is no longer a problem and all paths being analysed are obviously executable. In the static approach, analysis is not partial in the sense that the detection of anomalies is

not data dependent. Further, there is no duplication of analysis along the same execution paths.

From the authors' experience in using DFA, we are led to the conclusion that the technique of dynamic dataflow analysis is particularly powerful in revealing four types of errors in FORTRAN, namely, misspelling of identifiers, uninitialized variables, incorrect parameter references and misplacement of statements. These errors have long been held to be the most common and the obscure ones in FORTRAN. Experience with AIDA shows that the technique is also powerful enough to detect additional errors in PASCAL due to misuse of parameters (e.g. a variable parameter is declared as a value parameter) and the unintentional destruction of defined values. DDF has demonstrated the power of dynamic dataflow analysis for analysing pointers. Dynamic dataflow analysis is a very useful testing technique for those programming languages in which pointers are used extensively.

Although some of the anomalies detected by dynamic analysis are not errors, they are usually indications of bad programming practice, such as presuming the value of an uninitialized integer variable to be zero, using redundant code for unnecessary initialization of variables, etc 6. The detection of such anomalies will lead the programmer to inspect the code more closely and eventually will help to remove impurities thereby producing better quality and more easily maintainable software.

From the author's experience in using DFA, AIDA, DDF and COD, it has been ascertained what features a good practical dynamic dataflow analysis system should possess. These have been noted and discussed in the previous section. Furthermore, it has been observed that there are several areas which could be improved to make dynamic dataflow analysis systems more appealing as practical program testing and development tools. These are as follows:

First, overheads need to be reduced considerably. The method described earlier is a possible starting point. Further improvements could be made through the analysis of the predecessor and successor relation- ship between various instrumentation blocks.

Second, users will not be satisfied with a dynamic system which can only report and locate anomalies blindly without exhibiting any intelligence, especially since many anomalies are due to the same code in the program. A more intelligent anomaly reporter must be developed for the production of more concise and comprehensive reports. Furthermore, the system should be able to refer to the cumulative results of previous analyses for later analysis (e.g. see the dead-on-all-paths criterion for the identification of false alarms of d d and d u anomalies as described in the section on 'dataflow anomalies'. It is very desirable to have additional reports about the coverage of executed statements, branches and paths.

Third, the authors have pointed out earlier that the dynamic and the static approaches are complementary. They believe certain types of static analysis should be


incorporated into the dynamic system to supplement the intrinsic limitations of dynamic analysis so as to make the system a more practical tool. Two cases have already been mentioned in this paper which support this suggestion, namely, a simple static analysis for the reduct ion of overheads and a s ta tement-based approach, which originates from static analysis, to- wards the identification of false alarms for dd and du anomalies.

Finally, stronger criteria for classifing anomalies into fatal and nonfatal anomalies must be developed as this should help in automating more efficient and succesful location of errors amongst anomalies. Current criteria are too weak and leave many errors still classified as nonfatal anomalies. Good theoretical results for this aspect will make dataflow analysis a more appealing and thorough testing tool.

A project is currently underway to enhance DDF so that it incorporates many of the above suggestions.

Acknowledgements

The authors are very grateful to David Price for his invaluable discussions and to CSIRO for a research grant to support this project. Many thanks are due to Isaac Balbin for his help in preparing the diagrams.

References

1 Allen, F E 'A basis for program optimization' Proc. IFIP (1971) pp 385-390

2 Fosdick, L D and Osterweil, L J 'Data flow analysis in software reliability' ACM Comput. Surv. Vol 8 (1976) pp 305-330

3 0 s t e r w e i l , L J and Fosdick, L D 'DAVE - a validation error detection and documentation system for F O R T R A N programs' Software - Pract. & Exper. Vol 6 (1976) pp 473-486

4 Huang, J C 'Program instrumentation and software testing' Computer Vol 11 (1978)pp 25-32

5 Huang, J C 'Detection of data flow anomaly through program instrumentation' IEEE Trans. Software Eng. Vol 5 (1979) pp 226-236

6 Chen, T Y and Leung, H 'The use of data flow analysis as an aid to developing F O R T R A N programs' Proc. 9th Australian Comput. Con]~ (1982) pp 240-249

7 Chan, F T and Chert, T Y 'AIDA - a dynamic data flow anomaly detection system for Pascal programs" Software - Pract. & Exper. Vol 17 (1987) pp 227-239

8 Price, D A and Poole, P C 'Dynamic data flow analysis - a tool for reliability' Proc. 1st Australian Software Eng. Conf. (1986) pp 97-100

9 Chen, T Y, Kao, H, Luk, M S and Ying, W C 'COD - a dynamic data flow analysis system for COBOL" Inf. & Management Vol 12 (1987) pp 65-72

10 Kao, H and Chen, T Y 'Data flow analysis for COBOL' SIGPLAN Notices Vol 19 (1984) pp 18-21

11 Osterweil, L J, Fosdick, L D and Taylor, R N 'Error and anomaly diagnosis through data flow analysis" in Computer Program Testing, Chandrasekaran, B and Radicehi, S (Eds), North-Holland (1981) pp 35-63

12 Chan, F T 'Testing programs with data flow analysis' Proc. SEARCC 80 (1980) pp 285-293

13 Chan, F T and Chen, T Y 'On data flow analysis across subroutine boundary' Proc. Int. Comput. Symp. Vol 1 (1980) pp 170-176

14 Wilson, C and Osterweil, L J 'Omega - a data flow analysis tool for the C programming language' Proc. COMPSAC' 82 (1982) pp 9-18

15 Price, D A Program instrumentation for the detection of software anomalies M.Sc. Thesis, University of Melbourne (1985)

16 Forman, I R 'An algebra for data flow anomaly detection' Proc. 7th Int. Conf. Software Eng. (1984) pp 278-286

[]

vol 30 no 8 october 1988 505

Documents

Dynamic dataflow analysis