30
Bachelor’s Thesis SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM Vienna University of Technology Institute of Computer Languages Compilers and Languages Group February 8, 2014 Author: Matthias Reisinger Supervisor: Ao.Univ.Prof. Dipl.-Ing. Dr.techn Andreas Krall

SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

Embed Size (px)

DESCRIPTION

The CACAO VM is a virtual machine for the native execution of Java bytecode based on just-in-time compilation. Since bytecode interpretation is not applied, a fast compiler performs the initial compilation tasks to translate a Java pro- gram to machine code. Due to the need of short compilation time, exhaustive optimizations are not employed at this level. Subsequently, those parts of the program which are executed frequently have to be recompiled using a higher level of optimization. At the time this work is written, a new version of the compiler framework for performing these recompilation tasks is implemented for future integration into the CACAO VM. Prior to optimization, this framework trans- lates the program into a high-level intermediate representation based on static single assignment form. The subject of this work is to examine machine in- dependent optimization techniques and how to apply them to this intermediate representation respectively. The according considerations serve as the basis for implementation and incorporation into the new compiler. Concretely, the tech- niques called dead code elimination, constant folding, constant propagation and global value numbering have been subject to examination and have been realized accordingly. An empirical analysis of these optimizations discloses the effects of incorporating them into the new compiler.

Citation preview

Page 1: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

Bachelor’s Thesis

SSA-based Optimizations forJust-in-Time Compilation in the CACAO VM

Vienna University of Technology

Institute of Computer LanguagesCompilers and Languages Group

February 8, 2014

Author:Matthias Reisinger

Supervisor:Ao.Univ.Prof. Dipl.-Ing. Dr.techn Andreas Krall

Page 2: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

Abstract

The CACAO VM is a virtual machine for the native execution of Java bytecodebased on just-in-time compilation. Since bytecode interpretation is not applied,a fast compiler performs the initial compilation tasks to translate a Java pro-gram to machine code. Due to the need of short compilation time, exhaustiveoptimizations are not employed at this level. Subsequently, those parts of theprogram which are executed frequently have to be recompiled using a higher levelof optimization. At the time this work is written, a new version of the compilerframework for performing these recompilation tasks is implemented for futureintegration into the CACAO VM. Prior to optimization, this framework trans-lates the program into a high-level intermediate representation based on staticsingle assignment form. The subject of this work is to examine machine in-dependent optimization techniques and how to apply them to this intermediaterepresentation respectively. The according considerations serve as the basis forimplementation and incorporation into the new compiler. Concretely, the tech-niques called dead code elimination, constant folding, constant propagation andglobal value numbering have been subject to examination and have been realizedaccordingly. An empirical analysis of these optimizations discloses the effectsof incorporating them into the new compiler.

Page 3: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

Contents

1. Introduction 41.1. The CACAO VM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3. Aim of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4. Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2. The Intermediate Representation 5

3. Dead Code Elimination 73.1. Application to the Intermediate Representation . . . . . . . . . . . . . . . 7

4. Constant Folding & Propagation 84.1. Application to the Intermediate Representation . . . . . . . . . . . . . . . 10

5. Global Value Numbering 125.1. Application to the Intermediate Representation . . . . . . . . . . . . . . . 13

6. Evaluation 176.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7. Related Work 25

8. Conclusions 26

A. Alpern et al.’s Partitioning Algorithm 27

Page 4: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

1. Introduction 4

1. Introduction

1.1. The CACAO VM

The CACAO VM is a virtual machine for executing Java bytecode. Instead of interpret-ing this bytecode, just-in-time compilation is used to execute Java programs natively.Consequently, the first time a part of a program, or a method respectively, has to be ex-ecuted by the virtual machine, the so called baseline compiler will directly translate it tonative machine code, which is invoked subsequently. Short compilation time is requiredat this level, therefore exhaustive optimization tasks cannot be taken into account. As aconsequence, for those program parts which are executed very often, the machine codeproduced at this level will not provide adequate efficiency in general. The run-time be-havior of the program will be profiled in the following, to identify those methods whichare executed more frequently. The according program parts will then be recompiledby comprising in-depth optimization tasks to produce efficient native code eligible forfrequent execution. This approach is referred to as adaptive optimization. At present,these recompilation tasks are also achieved by the baseline compiler, but differently, atthis level the compiler applies additional optimization techniques to meet the accordingefficiency requirements.

1.2. Motivation

Due to reasons of maintenance and the need of a higher flexibility concerning the integra-tion of new optimization techniques, the baseline compiler will be replaced as optimizingcompiler. Therefore a new compiler will be incorporated into the CACAO VM, whichperforms the according tasks in the course of adaptive optimization. Currently thiscompiler is subject to implementation and is not yet fully operational, also in-depthoptimizations have to be integrated.

1.3. Aim of the Work

The subject of this thesis is to examine machine independent optimizations and how toapply them to the SSA-based intermediate representation used within the new compilerframework. These examinations form the basis for their implementation and integrationinto the compiler. Due to the fact that the compilation tasks are performed duringrun-time of the virtual machine, efficient algorithms are required so that fast executionof the optimizations is possible.

1.4. Structure of the Thesis

Section 2 gives a basic introduction to the intermediate representation used in the newcompiler framework. In sections 3, 4 and 5 we describe the optimization techniqueswe examined for the use within the compiler. For each of the optimizations there willbe given a description of its goals and general characteristics. The last part of eachof these sections is formed by exhaustive explanations of how the optimizations can beapplied to the intermediate representation, enclosed by algorithms formally describinghow these techniques can be realized. The effects of incorporating them into the compilerare discussed in section 6, where the results of the empirical evaluation are presented.

Page 5: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

2. The Intermediate Representation 5

In section 7 we give an overview of alternative approaches and describe how they aredifferent from the presented techniques. The last part of the thesis is formed by section8 which outlines the central findings of our work.

2. The Intermediate Representation

The new optimizing compiler of the CACAO VM uses two different forms of intermediaterepresentations for the code [Eis13]. A high-level representation model, based on staticsingle assignment form, builds the center of the architecture independent parts. Lateron, within the architecture dependent compiler back-end, this structure is transformedinto a low-level representation, which uses lists of machine instructions to form basicblocks. The optimization algorithms presented in subsequent sections are targeted atbeing incorporated into the machine independent front-end of the compiler and operateon the high-level representation. In the following, we will give an overview of its maincharacteristics because basic knowledge about it is necessary for a thorough understand-ing of the optimization algorithms. We will not go into detail on every aspect, so forfurther information refer to [Eis13].

As already mentioned, this intermediate representation is based on static single as-signment form. It represents a method’s code in the form of a graph-based structure,where instructions are represented by nodes. Each node has an associated type, which ispart of a static type hierarchy. Each type is a direct or indirect sub-type of Instruction,which forms the base of this hierarchy. In the following, we will use the terms nodeand Instruction interchangeably (similarly, to refer to nodes associated with a sub-typeof Instruction, we will use the according type name). The possible relationships betweennodes are represented by three different types of edges: control-flow edges, data-flowedges and scheduling edges.

The units of control-flow within this graph are the nodes of type BEGINInst andENDInst. These nodes form pairs which are comparable to basic blocks in traditionalrepresentation models. But as we will see later on, in contrast to basic blocks, there donot necessarily exist lists of instructions which are assigned to these control-flow units.On one hand, at BEGINInsts several control-flow paths merge, meaning that there canbe a number of control-flow edges, that go into a node of this type. Each BEGINInstwill have a single outgoing edge leading to exactly one ENDInst. On the other hand,these ENDInsts can have several outgoing control-flow edges to BEGINInsts. There aredifferent sub-types of ENDInst: GOTOInsts, for example, which can only have a singleoutgoing edge, or IFInsts with two outgoing edges; another important example are nodesof type RETURNInst, which do not have outgoing edges at all.

static long fact (long n) {long res = 1;while(1 < n) {

res ∗= n−−;}return res ;

}Listing 2.1 Example adopted from [Eis13]

Page 6: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

2. The Intermediate Representation 6

Besides these units to express the program’s control-flow, there are also primitivenodes that compute and supply data values. They can have multiple inputs, namelytheir operands, which are represented by ordered incoming data-flow edges. The valuecomputed by such nodes can be consumed and serve as operand for other nodes. Thelatter case is modeled by using data-flow edges starting at the supplying node andending at the consuming one. A special case of such nodes are those which representφ-functions. These PHIInsts serve to merge values at points where multiple control-flow paths join. Each φ-node is fixed to exactly one BEGINInst and accordingly hasto be scheduled after this node. Such dependencies are modeled by using schedulingedges which define the order of execution between nodes. Although there exist someexplicit scheduling dependencies like in the case of φ-nodes, an important aspect of thisintermediate representation is that there does not exist a complete instruction schedule.

The method in listing 2.1 would be translated into this intermediate representation ascan be seen in figure 2.1.

BeginInst

GOTOInst

LOADInst

BeginInst

IFInst

PHIInst PHIInst

MULInst

CONSTInst = 1

CONSTInst = 1

BeginInst

GOTOInst

BeginInst

RETURNInst

SUBInst

CONSTInst = 1

entry

Figure 2.1 Control-flow is marked by thick solid arrows. To highlight pairs ofBEGINInsts and ENDInsts, the according edges are dashed. Data-flow is repre-sented by thin arrows pointing into def-use direction. Scheduling dependenciesare marked blue. Example adopted from [Eis13]

Page 7: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

3. Dead Code Elimination 7

3. Dead Code Elimination

The term dead code elimination has originally been used to refer to two separate opti-mization techniques, namely unused code elimination and unreachable code elimination[WZ91]. Despite its title, this section is targeted only at the first of these two techniques.The examinations and algorithms for unused code elimination, which we are describinghere, are mainly based on considerations in [AP03], where the notion of dead code elim-ination is used as a synonym for this technique. Thus we will also prefer the accordingterm to refer to this optimization.

The goal of this technique can be described intuitively by the idea to remove allthose code sections from the program, whose results are never used. Assuming that theprogram, which is subject to dead code elimination, satisfies the property of static singleassignment form, a characterization of dead variables can be given:

Definition 3.1. A variable is dead if and only if its list of uses is empty and its assignmentstatement has no side effects.

It is further assumed that each statement in the program or in the according inter-mediate representation is an ordinary assignment, an assignment using a φ-function, afetch, store or branch [AP03]. Based on these considerations and on definition 3.1 theidea of this optimization can be characterized by the following short algorithm:

Algorithm 3.1 Basic idea of dead code elimination

1: while there is a dead variable v do2: remove v’s defining statement from the program3: end while

At each iteration, the algorithm removes one statement which defines a dead variable.Furthermore, the affected statement has to be removed from the use lists of those vari-ables which are referenced within such a defining assignment. This in turn can causethese variables to become dead, which will be handled in subsequent iterations.

3.1. Application to the Intermediate Representation

As mentioned earlier, the intermediate representation of the new compiler satisfies theproperty of static single assignment form. Nevertheless, it does not contain statementsor variables in the traditional sense, instead it uses nodes to express data-flow andcontrol-flow within the program. Therefore, it is not possible to directly apply thedefinition of dead variables to optimize this graph-based structure. But based on previousconsiderations, definition 3.2 gives a characterization of dead nodes (it explicitly excludescontrol-flow nodes, as they do not represent data computations and thus do not supplyvalues for consumption by other nodes).

Definition 3.2. A node, which does not mark control-flow, is dead if and only if its listof uses is empty, it has no side effects and there are no other dependencies on that node.

Based on the notion of dead nodes, it is now possible to translate the idea of deadcode elimination accordingly as shown in algorithm 3.2. This pseudo code leaves manydetails open and is not meant as a reference for implementation. Therefore, algorithm3.3 gives a more formal formulation for dead code elimination which is based on [AP03].

Page 8: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

4. Constant Folding & Propagation 8

Algorithm 3.2 Basic idea of dead code elimination on the intermediate representation

1: while there is a dead node n do2: remove node n from the graph3: end while

It iteratively removes dead nodes from the graph. On that account it uses a work listW, which before every iteration of the while-loop contains all those nodes that have tobe reconsidered by the algorithm. Despite this container is called “work list”, it is notimportant whether the underlying data structure is organized as a list or as anothercontainer type. The only aspects to consider are, that the removal in line 3 is donein constant time and that the insertion at line 7 (represented by the union operation)avoids to add duplicates, assuming a constant time membership check on the container(in fact it would also be possible to handle duplicates when removing nodes from thework list). At every iteration of the while-loop a node that has to be reconsidered isremoved from the work list, to examine if it is dead and thus should be deleted fromthe program. Every time a node is identified to be dead, it will also be deleted from thelist of uses of all the operand nodes, referenced by the dead node. Thus, in turn each ofthese nodes has to be reconsidered because its own list of uses can have become emptynow and hence is added to the work list.

Algorithm 3.3 Dead code elimination

1: W ← all nodes in the intermediate representation graph2: while W is not empty do3: remove some node n from W4: if n is dead then5: for each node xi used by n do6: delete n from the list of uses of xi7: W ←W ∪ {xi}8: end for9: end if

10: end while

The general running time of this algorithm is linear in the size of the intermediaterepresentation graph. Each time a node is removed from the work list and discovereddead, each of its operands will be visited (the number of operands of any node is equiva-lent to the number of use-def edges starting at that node). In case the node is not dead,the amount of work which is done is constant. An aspect that could lead to an increaseof the asymptotic run-time complexity of this algorithm is the deletion of nodes fromthe use lists of their operands in line 6. Therefore, a data structure that allows constanttime removal should be used for the realization of the use lists.

4. Constant Folding & Propagation

In general, constant folding and constant propagation are techniques that optimize pro-grams by performing transformations on those code parts that make use of values which

Page 9: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

4. Constant Folding & Propagation 9

are statically known to be constant. The reason to cover these optimizations within onlyone section and not devote a single one to each of them is, that they can be combinedvery effectively.

Constant folding, also known as constant expression evaluation, evaluates those ex-pressions at compile-time, for which each operand is a constant value [Muc97]. Thatmeans, computations that would have been done during run-time are now executedduring the compilation process. It is important, that the result of this compile-timecomputation has to be the same as the one that would have been computed during theexecution of the compiled program. Thus it is necessary to consider all those aspectsthat could cause a difference between the result of a static evaluation and that of therun-time evaluation of an expression. For example, if the floating point representation orthe floating point arithmetic of the compiler deviates from the run-time environment ofthe compiled program, this could lead to differing results of the according computations.Another aspect to consider is that exceptional cases — like divisions by zero — have tobe handled accordingly.

Assuming a program or intermediate representation is in static single assignment form,the general idea of constant propagation can be described by the following rule [AP03]:For any assignment of the form v ← c for some constant c, replace each use of the vari-able v by c. Nevertheless, this rule ignores an important case, namely the occurrence ofφ-functions, for which an additional rule has to be considered [AP03]: Any φ-functionof the form v ← φ(x1, x2, . . . , xn), where all xi are equal to some constant c, can bereplaced by v ← c. An iterative application of these rules can now be used to propagateconstants as far as possible through a program. The principles of constant propagationconsidered so far, describe a restricted form of this type of optimization. In section7 we will thus shortly present an extension of this optimization technique which com-bines constant propagation with unreachable code elimination and therefore discoversadditional optimization possibilities.

What is left to say is that constant folding, on the one hand, can lead to code that canbe further optimized by the use of constant propagation. The application of constantpropagation, on the other hand, can lead to code that in turn can be further optimizedby the use of constant folding. For example, the expression on the right-hand side ofthe assignment statement in listing 4.1 can be evaluated leading to the code in listing4.2. This provides the further possibility to propagate the constant value of variable a

to the return statement in the next line. This in turn yields further folding possibilitiesas can be seen in listing 4.3, where the operation within the return statement only usesconstant operands.

a = 1 + 2;return a + 3;

Listing 4.1

a = 3;return a + 3;

Listing 4.2

a = 3;return 3 + 3;

Listing 4.3

Obviously, the rules for constant propagation, defined earlier, lead to dead code, asshown in listing 4.3, where variable a is not referenced anymore. A run of dead codeelimination, as described in section 3, can be used to eliminate the unused code.

Page 10: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

4. Constant Folding & Propagation 10

4.1. Application to the Intermediate Representation

Due to the characteristics of the intermediate representation of the compiler framework,constant propagation will be an implicit effect of constant folding. Each node withconstant operands that represents an operation that can be evaluated statically (likearithmetical or logical operations) will be replaced by introducing a new node of typeCONSTInst. The value of this constant node will be the result of the original node’soperation, applied to the values of its operands. Furthermore, each node that referencesthe replaced node as an operand has to adjust the according references within its uselist to point to the newly introduced node. The replacement of nodes in the courseof the folding deletes all the references to them, thus they become dead according todefinition 3.2 in section 3.1. A subsequent run of dead code elimination at the end ofthis optimization would delete these dead nodes.

To illustrate the application of constant folding, we will reconsider the example codefrom listing 4.1, but we first have to translate it into the intermediate representation (forreasons of simplicity, the following figures will not include the dead nodes that wouldreside within the program). Figure 4.1 shows the graph, which is the result of thistranslation.

RETURNInst

ADDInst

CONSTInst = 3ADDInst

CONSTInst = 1 CONSTInst = 2

Figure 4.1

The topmost node of type ADDInst in this graphic represents the expression 1 + 2,having only constant operands. Thus, it can be evaluated and replaced by a new nodeof type CONSTInst whose value corresponds to the result of this operation. Figure 4.2presents the graph after the according transformation. The remaining ADDInst now hasonly operands of type CONSTInst. Thus, it can be folded further and replaced by a newconstant node, leading to the graph in figure 4.3.

RETURNInst

ADDInst

CONSTInst = 3CONSTInst = 3

Figure 4.2

RETURNInst

CONSTInst = 6

Figure 4.3

Page 11: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

4. Constant Folding & Propagation 11

Algorithm 4.1 Combined constant folding & propagation

1: W ← all nodes in the intermediate representation graph2: while W is not empty do3: remove some node n from W4: if all operands oi of node n are constant then5: if n is of type PHIInst and all oi are equal to some constant value c then6: ReplaceByConstant(n, c)7: else if n’s operation can be evaluated statically then8: x← the result of n’s operation applied to n’s operands9: ReplaceByConstant(n, x)

10: end if11: end if12: end while

These examples illustrate how the propagation of constants is done implicitly whenintroducing the new constant nodes during folding. The only places where propagationhas to be done explicitly, are occurrences of nodes which represent φ-functions. Thefollowing rule covers this case: Each node of type PHIInst, for which all operands areequal to some constant c, can be replaced by introducing a new node of type CONSTInst,whose value is set to c.

Based on the previous considerations, a more detailed formulation of this combinedversion of constant folding and constant propagation is given in algorithm 4.1. It is anadaption of a constant propagation algorithm described by Appel and Palsberg [AP03]and revisits the idea of using a work list for those nodes that should be reconsidered,like it already has been done for dead code elimination in section 3.1. We again assumeconstant time removal and the avoidance of duplicate entries in this container. At eachiteration of the while loop a node will be removed from the work list to consider if allof its operands are constant nodes. According to the previously defined rules, we thenhave to further distinguish if the node is of type PHIInst or if it represents any otheroperation that can be evaluated at compile-time. The steps that have to be done duringthe according node replacements have been outsourced to a procedure whose pseudocode is shown in algorithm 4.2.

In general, the execution of the condition test in line 4 of algorithm 4.1 would involveto examine the node’s whole operand list. Nevertheless, it is possible to implement thistest to be executed in constant time. One possibility would be to keep a counter foreach node, whose value represents the number of operands which are constants. Eachtime a CONSTInst is added to the node’s operand list, the counter will be increased.Accordingly, each time a CONSTInst is removed from the operand list, the counter willbe decreased. Thus if all the operands of a node are constant, then this counter wouldbe equal to the number of operands. The condition test could then be realized by acomparison of the counter to the size of the operand list.

The running time of the presented algorithm is in O(E ·O+N) where E is the numberof data-flow edges in the graph, O is the maximum number of operands of any node andN is the number of nodes in the program. We will constitute this in the following.

Each time a node is replaced by a new constant node (and obviously each node can bereplaced at most once), we will have to examine all of its users which is done by following

Page 12: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

5. Global Value Numbering 12

Algorithm 4.2 Procedure for node replacements

1: procedure ReplaceByConstant(nold , c)2: create a new constant node nnew3: nnew .value ← c4: for each user ui of nold do5: replace each occurrence of nold in ui’s list of operands by nnew6: W ←W ∪ {ui}7: end for8: end procedure

each def-use edge that starts at that node. Consequently, the maximum number ofdata-flow edges which will be visited in def-use direction during the whole algorithmis obviously limited by E. For each of the users, it is necessary to inspect the wholeoperand list, to replace all the references to the original node by a reference to the newconstant node. This takes time linear to the size of the corresponding operand list, whichis limited by O. Each user will then be placed on the work list for reconsideration. Inthe following, if one of these nodes is removed from the work list, the condition test inline 4 of algorithm 4.1 will be performed, which, with respect to previous considerations,can be done in constant time (this test is actually performed at least N times, sinceat the beginning of the algorithm, the work list contains all the nodes in the program).The algorithm will again pass through a node’s operand list in line 5, when determiningif all operands represent the same constant value, which again is limited by O.

5. Global Value Numbering

The equivalence of programs or the equivalence of expressions that are part of programsis undecidable in general [AWZ88]. Despite this impossibility to formulate an algorithmthat finds all the equivalences in a program, there exist optimization techniques thatidentify certain sub classes of these equivalencies. One of these techniques is global valuenumbering, which finds and removes redundant computations.

This optimization discovers redundancies by assigning value numbers to expressionsbased on their specific operation and the value numbers of their operands. Subsequently,all those occurrences of expressions that have been assigned the same value number,can then be replaced by a single one, thus eliminating redundant computations. Asthe name already suggests, this whole process is not restricted to the bounds of basicblocks, namely, it is applied in a global manner. There are several ways to realize thisoptimization which can generally be categorized into two approaches. The pessimisticapproach, on the one hand, initially assumes that all expressions are different by assigningdistinct value numbers to them. It then continues by trying to find expressions thatcan be proven to be equivalent and allots equal value numbers to them. On the otherhand, the optimistic approach initially assumes that all expressions — or specific subsetsthereof — are the same and will then refine this assumption. The algorithm which weformulated for integration into the compiler follows the optimistic approach, thereforewe will not go into further detail on the pessimistic one.

Page 13: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

5. Global Value Numbering 13

Basically, global value numbering detects equal variables or expressions based on theircongruence, which is a weaker property than equality. More specifically, if two expres-sions are congruent, this also implies that they are equal but the reverse does not hold ingeneral (an explanation of this relation between these properties will be given when wedescribe how to apply these concepts to nodes within our intermediate representation).Basically, two operations are said to be congruent if they use equal function symbolson congruent operands. With a few exceptions, which will be explained later on, thisdenotes the main characterization of this notion.

Optimistic global value numbering initially proceeds by creating sets of expressionswhich are supposed to be possibly congruent. For example, all arithmetical operationsthat use the same function symbol will be combined in a set. These sets can be referredto as blocks which are all part of the current partition. As long as it can be shown thatsome block contains expressions which are not congruent, the according blocks will besplit up and the partition will be refined until each of its blocks contains only congruentexpressions. When this refinement terminates, the result will be a final partition ofblocks of expressions that are known to compute equal values at run-time. Based onthis classification, it is then possible to remove redundant computations. The wholeprocess of this approach is outlined in figure 5.1, dividing this optimization into threesteps, which have just been described.

Initial partitioningStep 1

Partition refinementStep 2

Elimination of redundanciesas result of the final partition

Step 3

Figure 5.1 A schematic overview of the optimisticglobal value numbering approach. Figure adoptedfrom [Kon04].

5.1. Application to the Intermediate Representation

The algorithm that we have formulated for the application within the compiler followsthe optimistic global value numbering approach and is based on the principles describedabove. Instead of expressions, the blocks will now contain nodes. Accordingly, at the endof the partition refinement, all nodes in a block can be replaced by a single congruentone. But before going into detail on the algorithm, the notion of congruence has tobe defined in terms of our intermediate representation. We have already given a veryabstract characterization of this property which based the congruence of two operationson the equality of their function symbols and the congruence of their operands. But sofar, this ignores that some operations need special care, like φ-functions or operationswithout any operands, etc. For example, in the case of φ-functions whose accordingoperands have been proven to be all congruent, it is not possible to consolidate them

Page 14: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

5. Global Value Numbering 14

in general. An additional aspect has to be taken into account, namely, to be congruentthey also have to reside in the same basic block. Based on the considerations of Click[Cli95b] and Alpern et al. [AWZ88] we define congruence as done in the following.

Definition 5.1. Two nodes are congruent if and only if they are identical or one of thefollowing statements holds:

• Both nodes are CONSTInsts and represent the same constant value.

• Both nodes are PHIInsts which are assigned to the same BEGINInst and for each iit holds that the operands at index i of the nodes’ operand lists are congruent.

• Both nodes are of the same type and for each i it holds that the operands at indexi of the nodes’ operand lists are congruent but neither node is of type CONSTInst,PHIInst, BEGINInst or ENDInst nor does it have an effect on or depends on theglobal state.

As formulated in definition 5.1, congruence not only depends on the type of the op-eration but also on the order of the operands. This means that two operations likex + y and y + x would not be considered congruent, even though they would produceequal outputs. Therefore, this example illustrates that equality not necessarily impliescongruence as already mentioned before.

According to previous descriptions, the creation of the initial partition, in terms of ourintermediate representation, consists in combining possibly congruent nodes into blocks.The generation of these sets is based on several assumptions which are reflected in thefollowing basic rules:

• All CONSTInsts which represent the same constant value are combined into acommon block.

• All PHIInsts which are assigned to the same BEGINInst are combined into a commonblock.

• Each node which has an effect on or depends on the global state will go into aseparate block.

• All other nodes which are neither BEGINInsts nor ENDInsts will be merged intoblocks depending on their node type, i.e., all nodes of the same type will share ablock.

The subsequent refinement of this partition is comparable to the partitioning whenminimizing a finite automaton, where states are initially partitioned into final and non-final states. Based on the state transitions in the automaton, the initial partition will beimproved until it contains only sets of equal states. Hopcroft [Hop71] gave an algorithmwhich solves this problem in O(k ·n · log(n)) where n is the number of states and k is thenumber of symbols in the input alphabet of the according automaton. Reformulationsof the original algorithm that have been presented by Berstel et al. [BBCF10] andClick [Cli95b] form the basis of our approach listed in algorithm 5.1. It corresponds tothe second step in figure 5.1 (the third and last step in this schema has already beenexplained earlier in this section and the according considerations stay the same whenapplied to the nodes of the intermediate representation).

Page 15: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

5. Global Value Numbering 15

Algorithm 5.1 Global Value Numbering – Partition Refinement

1: P ← initial partition2: W ← ∅3: for each P ∈ P do4: for i = 1, . . . , O do5: W ←W ∪ (P, i)6: end for7: end for8: while W is not empty do9: remove some tuple (W, i) from W

10: split candidates ← ∅11: for each node x ∈W do12: for each node y ∈ x.def usei do13: split candidates ← split candidates ∪ {y.block}14: y.block .touched ← y.block .touched ∪ {y}15: end for16: end for17: for each partition X in split candidates do18: if |X| 6= |X.touched | then19: Split(X)20: end if21: X.touched ← ∅22: end for23: end while

The algorithm uses a work list, which contains pairs of the form (W, i) where W standsfor a block of nodes and i designates an operand index. The initialization of this listis done in the loop starting at line 3 (in the next line and the rest of the algorithm,O stands for the maximum size of any node’s operand list). For each block which ispart of the initial partition and each operand index in the given range, there will beplaced such a pair on the work list. Each time one of these tuples (W, i) is removed fromthe list, the algorithm will mark each block, which contains at least one node, whoseoperand at index i is part of W . On that account, all these blocks are gathered in theset split candidates (the notion x.def usei on line 12 designates the set of nodes whichuse x as their ith operand). Accordingly, for each of these “split candidates” we haveto remember all of its nodes, whose ith operand is in W , by collecting them in a block’stouched -set. They will be needed later in line 18 to decide whether this block has to besplit up into two separate ones. According to the definition of congruence, splitting hasto occur if it does not hold for all of the nodes in a block, that their ith operands arein the same block, namely, if some of these operands are in W and others are not. Thework that has to be done when this separation occurs is listed in algorithm 5.2. Here,all of a blocks “touched” nodes are moved to a new block. Note that after this operationit is true that each of the nodes that reside in the original block are incongruent to allof those that moved to the new one. Splitting blocks up into two separate ones, mightcause that other blocks have to be split up. Therefore, according tuples for the separatedblocks will have to be added to the work list.

Page 16: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

5. Global Value Numbering 16

Algorithm 5.2 Pseudo code for block splitting

1: procedure Split(P )2: remove all nodes in P.touched from P3: move P.touched to a new block P ′

4: P ← P ∪ {P ′}5: for j = 1, . . . , O do6: if (P, j) 6∈ W and |P | ≤ |P ′| then7: W ←W ∪ {(P, j)}8: else9: W ←W ∪ {(P ′, j)}

10: end if11: end for12: end procedure

Alpern et al. [AWZ88] gave a similar algorithm, based on a reformulation of Hopcroft’soriginal partition refinement [AHU74]. In appendix A we show that it yields an incorrectpartitioning.

On the Correctness of the Algorithm

To show that the given global value numbering algorithm is correct, we adapted theoriginal proof given by Hopcroft [Hop71], which can be done easily, as shown in thefollowing. Basically, the following claim has to hold:

Claim 5.1. On termination of the algorithm two nodes are congruent if and only if theyare in the same block.

Proof. First we will show that for all nodes x and y (x 6= y), it holds that x is notcongruent to y if x ∈ B and y ∈ C (B,C ∈ P), supplied that B 6= C. This can be doneby induction on the number of times lines 2 to 4 in algorithm 5.2 are executed, where aspecific block will be split up into two separate ones. If the statement holds before thenth time of execution, it will also be satisfied afterwards because splitting occurs onlywhen operand nodes at a given index have previously been shown to be not congruent.Obviously the statement is true after the initial partitioning and hence before the firstsplitting occurs. From this it follows that two nodes are on the same block at the endof the algorithm if they are congruent.

To prove the second part of the claim, we show, that two nodes that are not congruent,cannot be in the same block at termination of the algorithm. Therefore we have tointroduce a function δ : G × {1, . . . , O}n → G, where G is the set of nodes in theprogram, O is the maximum size of any node’s operand list and n ∈ N (in fact ourdefinition of δ is based on the transition function used in automata theory, which for acertain state-word pair gives the according successor state). For any node x ∈ G and anyn-tuple (o1, . . . , on) ∈ {1, . . . , O}n, δ(x, (o1, . . . , on)) denotes the node that is reached byfollowing the use-def edges at the operand indices given by (o1, . . . , on), starting at x. Ifn = 1 so that (o1, . . . , on) collapses to a one dimensional tuple (o1), we will abbreviateδ(x, (o1)) by δ(x, o).

Page 17: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 17

Assume two states x and y which are contained in the same block B ∈ P but are notcongruent. Without loss of generality, assume δ(x, o) ∈ C and δ(y, o) ∈ D (C,D ∈ P).Now there are two possibilities to be taken into account:

C 6= D: When the algorithm came to the point, that δ(x, o) and δ(y, o) first appeared intwo distinct blocks C and D, at least one of the tuples (C, o) and (D, o) was placedonW (if both C and D had been part of the initial partitioning both tuples wouldhave been added to the work list; otherwise just one of them would have beenplaced on W, namely, the smaller one). When either (C, o) or (D, o) is removedfrom the work list, then the algorithm will split the block that contains x andy so that x and y will be placed in two distinct sub blocks. Obviously this isinconsistent with the initial assumption that both nodes are in the same block.

C = D: This case can be reduced to the C 6= D case, as shown in the following. Based onthe assumption, that x and y are not congruent, it is obvious that there has to existan n-tuple (p1, . . . , pn) for some n so that δ(x, (p1, . . . , pn)) and δ(y, (p1, . . . , pn))are in distinct blocks. From this it follows that there has to be such an n-tuple(p1, . . . , pn) for a smallest n. Now let o be the last element of the according n-tuple,namely, let o = pn. It follows that δ(x, (p1, . . . , pn−1)) and δ(y, (p1, . . . , pn−1)) arenot congruent and δ(δ(x, (p1, . . . , pn−1)), o) and δ(δ(y, (p1, . . . , pn−1)), o) are in dis-tinct blocks. Now we can replace x by δ(x, (p1, . . . , pn−1)) and y by δ(y, (p1, . . . , pn−1)).

6. Evaluation

For evaluating the implementation of the described techniques, we tested the compilerwith activated optimizations as well as with deactivated optimizations (in the followingwe will refer to the new compiler as compiler2 ). Additionally, we did a comparison tothe baseline compiler.

6.1. Methodology

Test Environment

The system used for performing the tests comprised a 2.4 GHz Intel R© CoreTM

i5 dual-core processor and 4 GB RAM, running a Linux platform in 64 bit mode with a 3.11.0-15kernel.

Configuration of the CACAO VM

For building the VM, an LLVM Clang compiler (version 3.2-7ubuntu1 ) was used with op-timizations enabled. CACAO was configured with the following options: --disable-debug,--enable-compiler2, --enable-statistics, --enable-rt-timing, --enable-loggingand --enable-optimizations.

Page 18: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 18

Benchmarks

At the time of the evaluation the new compiler did not support advanced benchmarksuites like SPECjvm or DaCapo. Therefore, it was necessary to use a set of micro-benchmarks, targeted at testing only certain parts of the compiler. First of all, thebenchmarking-programs which have already been employed by Eisl [Eis13] were used.To systematically test the implemented optimization techniques, additional benchmarkshave been applied. The effects of constant folding and constant propagation have beenevaluated based on the following programs:

constArith contains constant arithmetical expressions aimed at being evaluated by con-stant folding.

constPhi has a nested structure of control-flow, so that translation to SSA form leadsto φ-functions which should be replaced in the course of constant propagation.

The benchmarks for testing global value numbering contain different kinds of redun-dancies which have to be removed when applying this optimization. They are listed inthe following:

congrArith involves congruent arithmetical operations.

congrPhi is organized in such a way that, when translated to SSA form, the programcontains congruent φ-functions, having to be replaced in the course of optimization.

congrArraybc repeatedly accesses arrays at certain positions, involving redundant arraybounds-checks which also should be removed by global value numbering.

We did not formulate benchmarking-programs to explicitly test the effects of deadcode elimination due to several facts. First of all, at the time of the evaluation thenew compiler did not support the compilation of dead code. Secondly, constant folding,constant propagation and global value numbering produce dead code in any way, sodead code elimination is tested implicitly each time the other techniques could realizeoptimizations.

Key Figures

The comparison of the benchmarks is centered on the size of the code produced bythe compilers, the time needed for compilation and the execution time of the compiledbenchmark code. The effects of constant folding and constant propagation have beenmeasured by counting the number of nodes representing constant expressions whichcould be evaluated during compile-time as well as the number of PHIInsts that couldbe replaced by constants. For the optimizations achieved by global value numbering,the number of detected redundant nodes, which will be removed from the program,is representative. Therefore, these redundancies have been recorded in terms of fourcategories of nodes: CONSTInsts, nodes of an arithmetical type, PHIInsts and nodeswhich represent array bounds-checks (i.e., ARRAYBOUNDSCHECKInsts). Additionally,the number of nodes which are removed from the program in the course of dead codeelimination have been counted as well as the nodes remaining in the program.

Page 19: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 19

6.2. Results

The code size of the compiled benchmarking-programs is illustrated in figure 6.1. Asexpected, the application of the optimizations clearly yields less code for constArith,constPhi, congrArith, congrPhi and congrArraybc than compilation without applyingthese techniques. Fortunately, the same is true for the original benchmarks used by Eislwhich generally have been expected to offer considerably less potential for optimization.In many cases the output size of the baseline compiler is even greater than that one ofthe new compiler with inactive optimizations. Only for one benchmark (namely conv)the baseline compiler yields less code than the optimized version of the new compiler.The exact recordings of the code size are listed in table 6.1. According to tables 6.5and 6.6, in almost all cases, the decrease in code size can be attributed to global valuenumbering. Only constArith and constPhi could be optimized by constant folding orconstant propagation respectively.

Table 6.7 shows the number of dead nodes which have been deleted in the course ofdead code elimination as well as the number of nodes remaining in the program. Due tothe fact that none of the benchmarking-programs originally contains any dead code, theaccording numbers solely depend on the modifications of the intermediate representationgraph that have been applied by constant folding, constant propagation or global valuenumbering. Nevertheless, these numbers have been appended for sake of completeness.

As illustrated in figure 6.2, the decline in the size of produced code yields faster ex-ecution in general. Nevertheless, the execution time of matMult is noticeable since —despite its size — the native code produced by the baseline compiler is clearly more effi-cient than that of the new compiler, even with the use of the implemented optimizationtechniques.

Figure 6.3 depicts the time needed to compile the benchmarks. In some cases theoptimizations reduce the compilation efforts of the new compiler which causes fastercompilation. Anyway, on average, the implemented optimizations lead to a slight increasein the running time of the new compiler.

Page 20: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 20

fact

piSpigot

power

matMult

matAdd

matTrans

conv

permut

constArith

constPhi

congrArith

congrPhi

congrArraybc

0

200

400

600

800

1,000 compiler2

optimized compiler2

baseline compiler

Figure 6.1 Code size (bytes)

fact

piSpigot

power

matMult

matAdd

matTrans

conv

permut

constArith

constPhi

congrArith

congrPhi

congrArraybc

0

0.2

0.4

0.6

0.8

1

1.2

·104

compiler2

optimized compiler2

baseline compiler

Figure 6.2 Execution time (µsec)

Page 21: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 21

fact

piSpigot

power

matMult

matAdd

matTrans

conv

permut

constArith

constPhi

congrArith

congrPhi

congrArraybc

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

·104

compiler2

optimized compiler2

baseline compiler

Figure 6.3 Compilation time (µsec)

Code size (bytes) Ratio (% )Benchmark c2 c2o bl c2o/c2 c2o/bl

fact 60 52 80 86.7 65.0piSpigot 374 300 344 80.2 87.2

power 58 55 72 94.5 76.1matMult 569 528 608 92.8 86.9matAdd 472 449 520 95.1 86.4

matTrans 915 836 896 91.4 93.3conv 1003 949 928 94.7 102.3

permut 520 357 520 68.6 68.7constArith 74 12 136 16.2 8.8

constPhi 205 132 200 64.3 65.9congrArith 164 76 264 46.2 28.7

congrPhi 173 128 168 74.2 76.5congrArraybc 244 133 264 54.5 50.4

Table 6.1 Comparison of the size of compiled benchmark code produced by com-piler2 (c2 ), compiler2 with activated optimizations (c2o) and baseline compiler(bl)

Page 22: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 22

Execution time (µsec) Ratio (% )Benchmark c2 c2o bl c2o/c2 c2o/bl

fact 3045 2512 3632 82.5 69.2piSpigot 6490 4376 5431 67.4 80.6

power 3487 2841 4608 81.5 61.7matMult 10003 9481 5424 94.8 174.8matAdd 3758 3936 4950 104.7 79.5

matTrans 9288 9080 9517 97.8 95.4conv 12536 9971 10979 79.5 90.8

permut 9851 6491 8242 65.9 78.8constArith 2189 1779 3146 81.2 56.5

constPhi 2407 2079 3074 86.4 67.6congrArith 2413 2061 2539 85.4 81.2

congrPhi 1814 1451 1602 80.0 90.6congrArraybc 3465 2430 4034 70.1 60.2

Table 6.2 Comparison of the execution time of compiled benchmark code pro-duced by compiler2 (c2 ), compiler2 with activated optimizations (c2o) andbaseline compiler (bl)

Compilation time (µsec) Ratio (% )Benchmark c2 c2o bl c2o/c2 c2o/bl

fact 1141 1632 66 143.1 2488.5piSpigot 3221 2560 82 79.5 3129.6

power 1275 1569 67 123.0 2327.1matMult 4150 4226 164 101.8 2582.6matAdd 3493 3377 135 96.7 2496.3

matTrans 5876 6324 182 107.6 3467.0conv 7253 6755 154 93.1 4398.7

permut 3139 3044 117 97.0 2593.8constArith 1165 1225 67 105.1 1836.1

constPhi 2668 2451 83 91.9 2955.2congrArith 1683 1531 65 91.0 2355.1

congrPhi 2133 2150 81 100.8 2648.3congrArraybc 1425 1371 64 96.2 2132.3

Table 6.3 Comparison of the compilation time of compiler2 (c2 ), compiler2with activated optimizations (c2o) and baseline compiler (bl)

Page 23: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 23

Compilation time (µsec) Ratio (% )Benchmark de cfp gvn de/c2o cfp/c2o gvn/c2o

fact 18 11 44 1.1 0.7 2.7piSpigot 55 25 147 2.2 1.0 5.7

power 16 10 45 1.0 0.6 2.9matMult 55 30 167 1.3 0.7 3.9matAdd 50 25 130 1.5 0.7 3.9

matTrans 114 53 289 1.8 0.8 4.6conv 110 55 282 1.6 0.8 4.2

permut 60 25 144 2.0 0.8 4.7constArith 40 24 107 3.3 2.0 8.7

constPhi 70 38 108 2.9 1.6 4.4congrArith 40 18 114 2.6 1.1 7.4

congrPhi 52 23 104 2.4 1.1 4.8congrArraybc 27 16 92 2.0 1.2 6.7

Table 6.4 Comparison of the time spent during dead code elimination (de),constant folding/propagation (cfp) and global value numbering (gvn) (absolutenumbers and ratio to total compilation time)

Replaced nodesBenchmark arithmetic PHIInst

fact 0 0piSpigot 0 0

power 0 0matMult 0 0matAdd 0 0

matTrans 0 0conv 0 0

permut 0 0constArith 13 0

constPhi 4 5congrArith 0 0

congrPhi 0 0congrArraybc 0 0

Table 6.5 Nodes that could be replaced by CONSTInsts in the course of constantfolding and constant propagation

Page 24: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

6. Evaluation 24

CONSTInst arithmeticBenchmark total redundant total redundant

fact 3 2 2 0piSpigot 20 9 21 3

power 3 1 2 0matMult 12 10 5 0matAdd 9 7 3 0

matTrans 17 15 10 0conv 19 17 11 0

permut 12 10 4 0constArith 17 5 13 0

constPhi 25 17 5 2congrArith 16 12 23 8

congrPhi 15 10 13 6congrArraybc 8 6 0 0

PHIInst array bounds-ch.Benchmark total redundant total redundant

fact 2 0 0 0piSpigot 4 0 0 0

power 2 0 0 0matMult 4 0 9 0matAdd 2 0 9 0

matTrans 7 0 16 3conv 8 0 16 1

permut 4 0 12 6constArith 0 0 0 0

constPhi 10 0 0 0congrArith 0 0 0 0

congrPhi 6 2 0 0congrArraybc 0 0 8 4

Table 6.6 Nodes detected as redundant by global value numbering

Page 25: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

7. Related Work 25

Benchmark deleted remaining

fact 2 14piSpigot 12 53

power 1 16matMult 10 76matAdd 7 68

matTrans 18 107conv 18 117

permut 16 57constArith 29 3

constPhi 26 47congrArith 20 22

congrPhi 18 37congrArraybc 10 18

Table 6.7 Dead nodes deleted during dead code elimination and remainingnodes in the program

7. Related Work

Section 4 introduced a simple algorithm which combines both constant folding and con-stant propagation. It discovers values which are statically known to be constant and usesthis information to evaluate expressions at compile-time. Nevertheless, this optimiza-tion is based on assumptions which restrict the number of detected constant expressions.Basically, it supposes that all instructions in the program can be reached, but for an im-provement of the optimization results, this assumption has to be dropped. What has tobe considered is the evaluation of constant conditions. Obviously, if a result of a condi-tion test can be computed at compile-time, it will be possible to gain information aboutwhich branches will never be executed during run-time. These unreachable code sectionscan then be ignored by further optimization and will be deleted from the program whichis also referred to as unreachable code elimination [AP03]. Furthermore, the removal ofconditional branches can have another positive effect, namely for the propagation of con-stants: The search for constants can now be restricted to the reachable branches whichpossibly yields propagation possibilities that have not been discovered before. Obviously,further propagation can raise additional potential regarding evaluation of condition testswhich again could lead to the deletion of branches.

An approach that combines both constant propagation and unreachable code elim-ination is followed by a technique called conditional constant propagation [WZ91]. Incontrast to the algorithm presented in section 4, it propagates values only to those codeparts which are already known to be reachable. That means, based on symbolic execu-tion, this optimization starts at the beginning of the program and continues by exploringwhich further code sections can be reached, where constant propagation will proceed.Wegman and Zadeck [WZ91] also give a formulation of this optimization which takesadvantage of static single assignment form, called sparse conditional constant propaga-tion.

Page 26: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

8. Conclusions 26

Another field, that offers various techniques yielding similar optimization results, re-gards the removal of redundancies in programs. This involves the discovery of equiva-lent expressions which, as mentioned earlier in section 5, is known to be an undecidableproblem. Besides the presented formulation of global value numbering, which discoverscertain equivalencies based on the congruence of expressions, there exist a number ofalternative techniques that strive to find preferably large portions of redundant compu-tations. One of those techniques is referred to as partial redundancy elimination which,in contrast to global value numbering, follows a lexical approach. This means, it iden-tifies expressions that are textually congruent and allows for discovery of redundantcomputations which are not necessarily value-congruent [Cli95b]. The scope of this op-timization spans the detection and removal of redundancies which occur along some butnot necessarily along all paths through a program. These are also referred to as partialredundancies [ALSU06]. More generally, it also detects common sub-expressions andloop-invariant code which can be viewed as special types of partial redundancies. Oncethis optimization has discovered redundant computations, it will delete or relocate theaccording expressions, so that they will be evaluated only as often as necessary duringexecution of the program.

In general, the portions of redundancies found by global value numbering and par-tial redundancy elimination are overlapping, however, the two optimizations do not findexactly the same equivalencies. According to Click [Cli95a], in practice, global valuenumbering can be expected to find more redundancies than partial redundancy elimina-tion. Approaches exist that try to combine both techniques to maximize the number ofdetected redundant expressions [Van04].

8. Conclusions

In this thesis we presented machine-independent optimizations for use within the new op-timizing compiler of the CACAO VM. Dead code elimination, constant folding, constantpropagation and global value numbering have been described and it has been shown howthey can be applied to the SSA-based intermediate representation of the compiler. Dueto the characteristics of this intermediate representation, constant propagation becomesan implicit effect of constant folding, hence both techniques can be combined in a singlecompiler pass. The program transformations in the course of constant folding, constantpropagation as well as global value numbering inherently lead to dead code and thushave to be followed by a run of dead code elimination.

Based on our examinations the optimizations have been implemented and added tothe compiler. Thereby, the code size as well as the execution time of compiled programscould be decreased which mainly can be attributed to global value numbering. Thoughcompilation time has slightly increased on average, in some cases the optimizations alsoachieve faster compilation, since less work has to be done within subsequent compilerpasses.

Page 27: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

A. Alpern et al.’s Partitioning Algorithm 27

A. Alpern et al.’s Partitioning Algorithm

In section 5.1 we presented an algorithm for the partition refinement process of globalvalue numbering. Alpern et al. [AWZ88] presented an according algorithm, which isalso inspired by Hopcroft’s algorithm for minimizing the number of states of a finiteautomaton. In fact, they do not use the original version of this algorithm, instead theybase their work on a reformulation given by Aho et al. [AHU74].

In their explanations Alpern et al. use an intermediate representation referred to asvalue graph, which is very similar to that one used by CACAO’s new compiler framework.Basically, expressions or operations are represented by nodes, the dependencies of thecorresponding values are modeled by edges. Each node is labeled with a function symbol,describing the kind of operation it represents. Two nodes in this graph are said to becongruent if the following prerequisites hold:

• The nodes have identical function labels.

• The corresponding destinations of the edges leaving the nodes are congruent.

Furthermore, the term congruence is defined as the maximal fixed point which satisfiesthese conditions.

Algorithm A.1 Alpern et al.’s partitioning algorithm

1: WAITING ← {1, 2, . . . , p}2: q ← p3: while WAITING 6= ∅ do4: select and delete an integer i from WAITING5: for m from 1 to k do6: INVERSE ← ∅7: for x in B[i] do8: INVERSE ← INVERSE ∪ F−1[m,x]9: end for

10: for each j such that B[j] ∩ INVERSE 6= ∅ and B[j] 6⊆ INVERSE do11: q ← q + 112: create a new block B[q]13: B[q]← B[j] ∩ INVERSE14: B[j]← B[j]−B[q]15: if j is in WAITING then16: add q to WAITING17: else18: if |B[j]| ≤ |B[q]| then19: add j to WAITING20: else21: add q to WAITING22: end if23: end if24: end for25: end for26: end while

Page 28: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

A. Alpern et al.’s Partitioning Algorithm 28

Based on these considerations it is now possible to create an initial partitioning of thenodes into blocks. According to the process described in section 5.1, at the beginning ofthis optimization all nodes with identical function labels are assumed to be congruent andtherefore, they are put in the same block. For the subsequent refinement of these blocks,the corresponding process, as formulated by Alpern et al., is presented in algorithmA.1. It takes as input the initial partitioning of nodes into blocks and a set of functionsfi where fi(x) denotes the node that serves as ith operand for x. The partitioning ofthe nodes is represented by a vector denoted by B, which contains blocks of nodes.Instead of blocks, the work list WAITING holds indices, which are used to refer to thecorresponding elements in B. Furthermore the algorithm uses a mapping F−1 so thatF−1[m,x] denotes the set of nodes that use x as their mth operand, namely, F−1[m,x]represents the inverse image of node x under fm.

Problems arise with this algorithm when an integer i is picked from the work list,so that at some iteration of the loop starting at line 5, it holds that the conditionB[j] ∩ INVERSE 6= ∅ and B[j] 6⊆ INVERSE at line 10 is true for j = i. Accordingsituations can cause that at termination of the algorithm, non-congruent nodes are inthe same block.

+n1

5 ∗n2

∗n3 5

5 5

+n4

5 /n5

/n6 5

5 5

Figure A.1

Figure A.1 shows a value graph which in turn comprises two sub-graphs representingtwo independent expressions (this example is illustrated based on the notation used byAlpern et al. [AWZ88], where data dependencies are modeled by arrows pointing inuse-def direction). Obviously nodes n1 and n4 are not congruent because the operandsn2 and n5 have distinct function labels. Accordingly, both nodes should be placed indifferent blocks at the end of the algorithm. We will show in the following, that thereexists a possible execution sequence, so that n1 and n4 will reside in the same block.Assume that after the initial partitioning, the blocks are defined as follows:

B[1] = {n2, n3}B[2] = {n5, n6}B[3] = {n1, n4}B[4] = nodes with constant value 5

At the beginning of the algorithm, WAITING is initialized to {1, 2, 3, 4}. Withoutloss of generality, assume that at the first iteration of the while loop, the integer 1 ispicked from WAITING. The algorithm continues at the loop in line 5 for m = 1 and

Page 29: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

A. Alpern et al.’s Partitioning Algorithm 29

computes the union of the inverse images for all nodes in B[1] under f1, leading toINVERSE = {n2}. Obviously, at line 10, the condition in the loop header is true onlyfor j = 1. Accordingly, a new block B[5] = {n2} will be created and the node n2 isremoved from B[1]. Since 1 is not in the WAITING set anymore and |B[1]| ≤ |B[5]|, 1will be added to WAITING again (see lines 15 et seqq.). At the next iteration of theoutmost for-loop (for m = 2), the set B[1] will only contain the node n3, thus, there willnot happen any changes to the blocks.

For the next iteration of the while loop, we assume, without loss of generality, thatthe integer 2 is selected from the work list. The following execution of the loop bodywill proceed similar to the previous iteration described above. Accordingly, there willbe created a new block B[6] = {n5} and B[2] will be changed to only contain n6.Furthermore, 2 will be added to WAITING, which will then contain 1, 2, 3 and 4.

In subsequent iterations, the algorithm will select and delete all the remaining indicesin WAITING but no further changes to the blocks will be applied. Thus, at termination,B will contain the following blocks:

B[1] = {n3}B[2] = {n6}B[3] = {n1, n4}B[4] = nodes with constant value 5

B[5] = {n2}B[6] = {n5}

As claimed above, the nodes n1 and n4 are contained in one common block. Thus,the algorithm yields an incorrect partitioning.

Page 30: SSA-based Optimizations for Just-in-Time Compilation in the CACAO VM

References 30

References

[AHU74] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The Design andAnalysis of Computer Algorithms. Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 1974.

[ALSU06] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers:Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 2nd edition, 2006.

[AP03] Andrew W. Appel and Jens Palsberg. Modern Compiler Implementation inJava. Cambridge University Press, New York, NY, USA, 2nd edition, 2003.

[AWZ88] B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting Equality of Variablesin Programs. In Proceedings of the 15th ACM SIGPLAN-SIGACT Sympo-sium on Principles of Programming Languages, POPL ’88, pages 1–11, NewYork, NY, USA, 1988. ACM.

[BBCF10] Jean Berstel, Luc Boasson, Olivier Carton, and Isabelle Fagnot. Minimizationof Automata. CoRR, abs/1010.5318, 2010.

[Cli95a] Cliff Click. Global Code Motion/Global Value Numbering. SIGPLAN Not.,30(6):246–257, June 1995.

[Cli95b] Clifford Noel Click, Jr. Combining Analyses, Combining Optimizations. PhDthesis, Houston, TX, USA, 1995.

[Eis13] Josef Eisl. Optimization Framework for the CACAO VM. Master’s thesis,Vienna University of Technology, 2013.

[Hop71] John E. Hopcroft. An N Log N Algorithm for Minimizing States in a FiniteAutomaton. Technical report, Stanford, CA, USA, 1971.

[Kon04] Marten Kongstad. An Implementation of Global Value Numbering in theGNU Compiler Collection with Performance Measurements, 2004.

[Muc97] Steven S. Muchnick. Advanced Compiler Design and Implementation. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1997.

[Van04] Thomas John Vandrunen. Partial Redundancy Elimination for Global ValueNumbering. PhD thesis, West Lafayette, IN, USA, 2004.

[WZ91] Mark N. Wegman and F. Kenneth Zadeck. Constant Propagation with Con-ditional Branches. ACM Trans. Program. Lang. Syst., 13(2):181–210, April1991.