Static Analysis of Computer programs

1 Introduction

The problem of slicing requires us to solve data dependencies and control de-

pendencies. Finding data dependencies require us to solve the problem of reach-

ing definitions. Control dependencies require finding dominance frontiers. The

problem of aliasing has to to solved for computing correct data dependencies.

Each of these individual problems can be viewed as instance of the more general

problem of program analysis.

2 Program Analysis

Program analysis is the method of statically computing properties of a program.

Program analysis is useful for performing optimizations. There can be many

interesting properties that can be queried about programs. [?]. In particular

for slicing we are interested in the following.

1. What definitions of a variable can reach a particular program point

2. What objects can a reference variable point to

3. What types can a receiver variable point to

4. Can the pointer p point to null

5. For computing control dependence information, dominator information is

needed

6. What variables can be modified or referenced by a procedure

The actual paths that are taken during execution is difficult to determine.

So it is difficult to give precise answers to many of these questions related to

program analysis, approximate solutions are computed. These solutions are

conservative in the sense, they err on the safer side. They overestimate the

number of definitions reaching a program point or the objects the reference

variable can point to.

The initial methods of performing data flow analysis was done by executing

the program symbolically tracing through all control flow paths collecting in-

formation about dataflow values. This is computationally expensive and there

can be problem of termination if dataflow values keep alternating.

Monotone data flow frameworks overcome these problems by

1

1. putting a partial order on the abstract data values such that they change

in the same direction during abstract interpretation , reducing termination

problems.

2. If two control flow paths merge, we have to assume conservative informa-

tion that holds good about the abstract values along both the paths. A

semilattice L is used to represent the abstract values because the meet

operation gives us exactly this information.

3. To every node v in the CFG, we assign a variable [[v]] ranging over the

elements of L.

4. for every point in the program ,an equation that relates the value of the

variable of the corresponding node to those of other nodes (typically the

neighbors).

To guarantee termination, two useful limitations are made on the general

framework, which enforces all the transfer functions to be monotone and the

height of the lattice to be finite. Intuitively, this means that the transfer function

can only climb up and since the height is finite, it can at most reach the top

in the end and thus have to stabilize. This framework is called monotone data

flow analysis framework and its formal definition is listed below

With each node a function is associated which maps each value in the semi-

lattice to another value in the semilattice. The set of such functions at every

node in the program forms a system of dataflow equations. A fixed point solu-

tion to these equations gives the conservative information that can be assumed

to be valid at every node in the program.

For dataflow frameworks the set of functions F has to satisfy the following

properties.

1. Closedness under composition is important because we can summarize the

effects of a sequence statements without leaving the function space.

2. Identity function has to be there to take into account of empty basic

blocks.

3. Closed under pointwise meet, if h(x) = f(x) ∧ g(x) then h ∈ F . This is

necessary because h can represent the effect of convergence of two control

flow paths.

2

4. Monotonicity of functions in F guarantee the existence of fixed point so-

lution.

Distributivity

2.1 Characterization of data flow frameworks

In presence of loops, the transfer function f has to be applied multiple times

to summarize the effect of loops. If f describes dataflow effect of once around a

cyclic path, then f∗ represents the effect of a loop where the number of itera-

tions is apriori indeterminate.

f∗ is said to be k-bounded , if the solution reaches a fixedpoint values before

the k-th pass around the loop. For the classical bitvector problems ,the transfer

function is given by f(x) = GEN + X −KILL, thus f ◦ f = f , making them

2-bounded. Frameworks for which f◦f = f hold true are called fast frameworks.

In summary, a dataflow framework is characterized by

1. Algebraic properties of functions in F ( monotonicity, distributivity)

2. Finiteness properties of functions in F( boundedness, fastness, rapidity)

3. Finiteness properties of lattice L (height)

4. Partitionability of L and F

There is an important class of k-bounded partionable problems called bitvector

problems.

2.2 Dataflow Analysis Examples

Dataflow analysis techniques are used for the discovery of compile time code

improvements.

see p.12. JFL for dataflow frameworks defn Reaching Definitions The

reaching definitions problem asks for each node (program point) n, which as-

signments might have determined the values of the program variables at n. It

uses (℘(Asgn),⊇,∪, Asgn, ∅) for property lattice.

3

2.3 Solution Procedures

The solution to a dataflow problem

2.3.1 Meet over all paths solution

A transfer function is associated with every basic block. We can define a trans-

fer function associated with a path P given by x0 → x1 → x2.... → xn as

compositions of transfer functions associated with individual basic blocks.

The meet over all paths solution is given by

If all paths are executable, then MOP solution is the best statically deter-

minable solution. However it is infeasible to determine those paths that are

statically executed. The inaccuracy results from the fact additional infeasible

paths may be included. Thus the MOP solution is a conservative approximation

to the real solution.

2.3.2 Maximal Fixed Point solution

The MFP solution is given by

Though MOP is the best possible solution, it has been proved that a gen-

eral algorithm to compute MOP solution does not exist [?]. Intuitively this is

due to presence of loops which leads to infinite number of paths. MFP solu-

tion sacrifices precision for computability. MFP solution is less precise because

it considers only the effect on flow values by adjacent neighbors. In case of

distributive framework both the solutions are identical.

2.3.3 Iterative Methods

The iterative method solves the system of equations by initializing the node

variables to some conservative values and successively recomputing them till

a fixed point is reached. The naive implementation of this is called chaotic

4

iteration. Since this is not efficient clever iterative algorithms like worklist,

roundrobin, node listing are used.

Kam Ullman showed that a class of problems that a round robin algorithm

visiting nodes in reverse post order can solve in d(G)+3 passes over the graph

G. d(G) is the depth of the graph, the maximum no of edges that can occur on

any acyclic path in G.

2.3.4 Elimination Methods

When the CFG is reducible, which means that there are no multiple entry

loops, elimination algorithm is preferred because it is usually more efficient than

iterative algorithms. The central idea of elimination method is to represent all

paths from start node to a particular node as a regular expression. We can

get a high level summary function fre corresponding to the regular function

by replacing each node with corresponding transfer function, the concatenation

with function composition , the * operator with function closures, the union with

meet operation. Thus the data value at a node is fre(init). The requirement of

reducibility of the CFG assures that the summary function could be obtained

from the above operations. Since a regular expression is used to represent a set

of paths, elimination method is also called path algebra based approach.

Elimination methods exploit the structural properties of the graph. The

flow graph is reduced to single node using a series of graph transformations.

The data flow properties of a node in a region are determined from dataflow

properties of the region’s header node.

3 Other Methods of Program Analysis

1. Abstract Interpretation

2. Constraint based analysis

Many data flow analysis problems like reaching definitions can be efficiently

solved using abstract interpretation. For other problems like pointer analysis,

type inference , constraint based approach is more suitable.

Using Data flow analysis, many interesting pieces of information cannot be

gathered because data flow analysis does not make use of the semantic of the

5

programming language’s operators.

Abstract Interpretation

Abstract interpretation amount to executing the program on the abstract

values instead of actual values. For example, the abstract values for integers

can the their signs or the property of being even/odd. Since concrete computa-

tion is not performed , the results inferred by abstract interpretation can only

be approximate. Another reason for approximation is that abstract interpre-

tation computes conservative information that are valid in all control flow paths.

Abstract interpretation is a theory of semantics approximation. The idea of

abstract interpretation is to create a new semantic of the programming language

so that the semantic always terminates and the store for every program points

contains a superset of the values that are possible in the actual semantic, for

every possible input. Since in this new semantic a store does not contain a single

value for a variable but a set (or interval) of possible values, the evaluation of

boolean and arithmetic expressions must be redefined.

Abstract interpretation is a very powerful program analysis method. It uses

information on the programming language’s semantic and can detect possible

runtime errors, like division by zero or variable overflow. Since abstract inter-

pretation can be computationally very expensive it must be taken care choose

an appropriate value domain and appropiate heuristic for loop termination to

ensure feasabillity.

For example, in the following program, determining the actual values x can

take is not feasible. However, if we are interested in abstract values of x say odd,

even we can determine that x can have only even values. Abstract interpretation

determines this information.

void f() {

int x=2;

while(...) {

// what values x can have here

x=x+2;

}

}

6

The most common application of abstract interpretation for solving dataflow

problems.

Set based analysis

In set based analysis , sets of values are associated with a variable. The

major difference between set based analysis and abstract interpretation is that

set based analysis does not employ an abstract domain to approximate the

underlying domain of computation. In set based analysis, the approximation

occurs because all dependencies between variables are ignored. For example, if

at some point in the program , the environment [x → 1, y → 2] and [x → 3, y →4] can be encountered, then set based analysis will conclude that x can have

values 1,3 and y can have values 2,4. The dependency ”x is 1 when y is 2” is

lost.

In constraint based analysis, the properties to be computed are expressed as

set constraints. In the following example, the property to be computed is the

set of values a variable can have at runtime.

if(...) {

x=3;

}

else {

x=6;

}

// x can have {3,6}

print (x)

These sets need not only represent integers as in the above example. The

sets can represents points to sets of a variable or types of receiver variable.

This ”set based” approach to program analysis consists of three phases.

The first step is to label the interested program points, which may be terms,

expressions or program variables. The second step is to associate with each

label a variable which denotes the abstract values at the point. The one can

derive a set of constraints on these variables. In the final step, these constraints

are solved to find their minimum solution (this solving process is the main

computational part of the analysis).

To solve the constraints, the usual procedure is to represent each set expres-

sion as a node and each constraint as a directed edge. A transitive closure on

7

this graph gives all possible constraints that can be inferred.

Data flow analysis can be viewed as set based analysis. For example consider

live variable analysis for C programs.

1. domain : set of program variables

2. Variables : Sin, Sout

3. Constants : Sdef , Suse

4. Constraints : Sin = Suse ∪ (Sout − Sdef )

5. Sout =⋃

X∈succ(S) Xin

3.1 Interprocedural Analysis - Context Sensitivity

3.1.1 Functional Approach

The idea behind functional approach is same as that of elimination algorithms,

where we used a large transfer function to summarize the effect of a region.

The set of valid paths in the intraprocedural case can be represented by regular

expression. In interprocedural cases, the calls and returns have to be matched.

The set of valid paths is given by context free grammar. A non terminal in

the context free grammar represents the procedure. It is possible to compute a

large function corresponding to the non terminal which summarizes the effects

of those paths produced by the non terminal .

example

3.1.2 Call String Approach

The idea behind call string approach is the contents of the stack is an important

element to distinguish between dataflow information. The context is usually the

current call stack. The use of call string guarantee that dataflow values from

invalid paths are not considered. The length of call string may be unbounded ,

so k-bounded call strings are used.

8

Technology

Static Analysis of Computer programs