Implementation of Partial Redundancy Elimination (PRE) for ... · on value equivalence. If the below statements from Fig 2 are part of the code to be optimized then the computation

1

Implementation of Partial Redundancy Elimination (PRE) for MLton Compiler

Student: Srinath Kanna Dhandapani | Advisor: Dr. Matthew Fluet

1. Abstract The goal of the project is to implement the GVN-PRE[1] optimization pass in MLton, for eliminating partial redundancies. With the integration of this pass in MLton, all kinds of full redundancies should automatically be eliminated and also perform loop invariant code motion along with it. The same pass can be extended for eliminating redundancies with constructors and tuples having higher register loads along with eliminating simple expressions. This optimization can be used to perform various other improvements in the resulting code like eliminating code segments based on known cases.

2. Motivation Partial redundancy elimination is a one of the important optimization passes in a compiler. The approach proposed in VanDrunen’s work based on Global Value Numbering[1] in a straightforward manner compared to other algorithms like SSAPRE[2] is the main motivation to test this pass for MLton. The runtime improvements up to three times from GVN-PRE[1] shows the significance of eliminating partial redundancies in a program. As SSA being intermediate representation for which the GVN-PRE is proposed can be directly applied on MLton’s SSA with minor tweaks. The main advantage of this implementation is the benefit of extending the same approach to eliminate large data structure expressions as they being immutable in Standard ML.

3. Introduction

3.1 MLton MLton[3] is a whole program optimizing compiler for the language Standard ML. It has the best performance with respect to run-time among its competitors. And optimizes the program having the complete information of all the files in scope and tries to reduce as much as it can. MLton has its wide usage in both academic sides as well as at industry level. Besides the compiler, MLton also offers its basis library which has lots of everyday functions readily available and it’s written using these libraries internally without relying on third party sources. This 20-year old open source compiler is very matured in providing errors in a better understandable manner for the programmers to resolve them.

3.2 Optimization Optimization is a phase in the compiler where the intermediate representation(IR) is optimized to improve the efficiency of the target program in terms of both runtime and code size. This involves the reduction of instructions by eliminating few of the statements by representing the code in a way that is more efficient. The most important rule for optimization is that the optimization carried out should not change the meaning of the

2

code written by the programmer. Optimization is important not for the reason that the programmer is not aware of writing optimized code with fewer instructions but at the IR level of the code, various further optimizations can be performed that are actually out of scope from the programmer’s point of view. Optimization in MLton involves various optimization passes like commonSubexp, etc. Each handling optimization in terms of specific criteria. The most important pass with respect to PRE implementation is commonSubexp.

3.3 CommonSubexp Common Subexpression Elimination (CSE) optimization pass in MLton based on Global Value Numbering (GVN) performs the elimination of fully redundant expressions based on value equivalence. If the below statements from Fig 2 are part of the code to be optimized then the computation of B + C is same as A + B which is already stored in X. And the optimized code can completely eliminate the instruction of storing A’s value in C and assign the value of X to Y as represented in Fig 3. Here the expressions A + B and B + C both evaluate to the same value and this kind of equivalence between expressions that compute to the same value is known as value equivalence.

Fig 2: Original Code Fig 3: Optimized Code

Fig 4a: Partial Redundancy Fig 4b: Full Redundancy not based on dominated condition CSE cannot be used for the code in Fig 4a for optimization. Even though the computation of expressions A + B is redundant but it is not available from both the control flow paths as shown in Fig 4a. So those expressions that are available from all the control flow paths are called full redundancy. And those full redundancies where the expression is

X = A + B

Y = X

X = A + B

C = A

Y = B + C

3

dominated by another equivalent expression in the dominator tree are only valid for elimination by CSE not like in Fig 4b, as the redundant expressions are at the same level in dominator tree even though they are available from both the control flow paths. The major drawback of this pass which is beyond the scope of it is code motion. Code motion can make the code globally available for the blocks that commonly use a particular set of statements and can benefit from making them available to these blocks without lengthening any control paths.

3.4 Loop Invariant Code Motion Another important optimization is loop invariant code motion. That is the motion of code in loops that does not change for any iteration outside it. This helps to reduce register pressure and avoids repeated computations within loops. This can be understood with a variable assignment in a “for loop” statement as shown in Fig 5. As this assignment does not change throughout the execution of the loop it is better to move them outside the loop and make it available throughout the execution of the loop by assigning it only once.

Fig 5: Example for Loop Invariant Code Motion

3.5 Partial Redundancy Elimination

Partially redundant expressions are those expressions that are redundant through few of the control paths but not through all of them as in control flow statements like “case” or “if-else”. The example in fig 3 shows the partial redundancy in “if-else” statement. This optimization with GVN-PRE is proven to perform CSE, loop invariant code motion along with eliminating partial redundancy. It not only catches the value equivalency but also eliminates the lexical equivalencies. According to VanDrunen’s work[1], the proposed hybrid implementation of GVN-PRE taking the advantages of both global value numbering and partial redundant elimination not only defeats the performance of commonSubexp but also is an easy approach to implement PRE, unlike its influenced work from SSAPRE[2]. The way PRE eliminates partial redundancy can be understood as it first identifies where and how this redundancy is present. Once it identifies this detail then it tries to convert the partial redundancies into full redundancies by adding the expressions in the blocks that misses these expressions. Finally, the full redundancies are eliminated using Global Value Numbering technique. The Fig 6 shows how GVN-PRE does perform all common subexpression elimination, loop invariant code motion, and partial redundancy elimination.

4

Fig 6a: Original Code Fig 6b: Optimized Code with GVN-PRE

3.6 SSA Intermediate Representation (IR) SSA - Single Static Assignment form is an intermediate representation where all the complex statements are converted to simple statements. For example loops, case statements are converted using “if-else” statements. All the variables in the SSA form are unique and the variables that have scope in many blocks are represented using different version numbers appended to them. These variables if are merging from different control paths in a block then they are introduced with phi statement in the merging block passing its versioned variable names in its predecessor as arguments. MLton uses SSA form for its optimization after type checking but it is slightly different. Instead of phi statements, it uses goto transfers where the versioned variable used in the predecessors are passed as arguments and the block where all the predecessors meet to receive this variable being passed in its argument. The argument of the block automatically uses the version of the variable from its predecessor based on the control path from which it is called. More differences between the SSA used in GVN-PRE[1] and that of MLton is discussed more elaborately in the following sections. Fig 7 shows an example of how a sample program is translated from its CFG to SSA Form.

5

Fig 7a: Standard ML Code Fig 7b: In CFG Form Fig 7c: In SSA Form

3.7 Dominator Tree

Fig 9a: SSA Fig 9b: Dominator Tree Dominator Tree is the tree representation of all the blocks in the control flow graph based on their dominance. Fig 9b shows the dominator tree for the SSA in Fig 9a. The dominance of a block over the other refers to how a block is dependent on the predecessor blocks in the control flow graph. The dominator tree plays a vital role to convert a CFG to its SSA form. As explained in the section 3.6 dominator tree is used to add phi statements in the SSA. It is also used to find the presence of loops in the CFG’s. The GVN-PRE[1] algorithm is completely based on the traversal of the dominator tree in top-down order of dominator and the post-dominator tree.

6

4. Related Work 4.1 Partial Redundancy Elimination for Global Value Numbering

This paper [1] proposes three separate methods to implement PRE

• ASSAPRE - Anticipation based SSAPRE from the extension of SSAPRE [2].

• Value-based Partial Redundancy Elimination GVN-PRE.

• Load elimination enhanced with Partial Redundancy Elimination (LEPRE). ASSAPRE is based on the corner cases in SSAPRE[1] where it fails and the ways to overcome it. The simple approach of removing the redundancy in bottom-up order instead of top-down order handles the SSA breaking conditions well. GVN-PRE is a simple way to implement PRE, with the integration of ASSAPRE and GVN to remove partial redundancies. LEPRE is an extended work on GVN-PRE where the removal of redundancies on the simple expressions is extended to handle complex reference Objects. This project is completely based on the second method GVN-PRE. In this work, the VanDrunen defines values as the list of all the expressions that evaluate to the same value. And expressions as the expressions translated in the form of variables in them to their values. So, in order to avoid confusion in the following sections, the expressions are the normal expressions in any SSA form, value expressions are the GVN-PRE world expressions and values are the list of value expressions that share same values. Value expressions here refer to the types of expressions in the program that are considered for elimination through this optimization pass.

Value expressions v ::= t | V1 op V2

Where, t is any variable – which refers to the scenario of copying a value of the variable to another variable. Next any other complex expression like A + B that are composed of many variables and operators where the variables are translated to their corresponding values like VA + VB.

Similar to CSE this algorithm is value based, so a global hash table is created with value expression as the HASH and its values as hash table VALUE. This table now has the knowledge of all the variables and the expressions that are value equivalent. For example, if X and Y variables have the value 1. Then, in this case, X is inserted with its VALUE as {VX, VY, V1}. Similarly, an entry for Y is also created with the same VALUE of X assigned with reference. This way the table size grows with working on the previous entries and holding one instance of the VALUE for the set of all the KEYS sharing the same values. The list of operations that are possible with this global table are:-

o lookup () – to check the entry for a value expression and returns its corresponding VALUE if present.

o lookup_or_add () – similar to lookup but instead of returning NULL in the case of absence it inserts a new entry for the value expression.

7

o add () – to insert a new entry for the value expression. The creation of new entries is done by creating an empty set as VALUE for the value expression being inserted, where the value expression itself is added to the set as a self-value knowledge.

The ability and the efficiency of the PRE algorithm are directly compared with the effectiveness of the table for example with the addition of algebraic simplification’s and commutativity properties support.

Fig 8: Statements in SSA

Before going into the working of the algorithm VanDrunen[1] mentions a lot of information regarding its framework and all the important details that the algorithm depends upon. The most important of which are the ability to cache and collecting all the necessary parts while going through the code that is to be optimized and categorizing them. For which a set of seven sets are created for each block in the SSA tree. They are as follows:-

1. Exp_Gen – operands and the value expression of the expressions are collected. 2. Tmp_Gen – the target variables in statements are collected. 3. Phi_Gen – the target variables in Phi statements are collected. 4. Avail_Out – the collection of target variables that are available from the current

block to its successor 5. Avail_In – similar to Avail_Out representing available value expressions that are

nothing but target variables to the block from its predecessor. 6. Antic_In – the collection of value expressions that can be anticipated to the block

meaning those expressions that can be made available in a block with the available list of expressions from the block. They represent mainly the source expressions this way it would be very helpful in analyzing the presence of any of them and retrace their presence and absence in their predecessors and find the partial redundancies.

7. Antic_Out – similar to Antic_In representing the anticipated expressions from the block.

Fig 9c shows the Antic_In and Avail_Out for the example SSA program in Fig 9b. The expression Y1 + 2 is partially redundant as its availability in block B3. Antic_In as defined above is the way of anticipating if it's possible to hoist an expression as high as possible in the dominator tree with the set of available expressions in a block. But in order to not increase its scope and increase all the control path lengths, it is postponed as much as possible. So, once the Antic_In for all the blocks are found most blocks can have more expressions in the Antic_In but it is all the expressions that are possible to hoisted in them but not yet handled. Whereas Avail_Out has limited access to the expressions that it

8

already handles this way if the mismatch between the Antic_In and Avail_Out are done between the merging block and its predecessors then partial redundancy can be easily identified. The value of the Antic_In for the above example converges after 2 iterations. Even though different versions of the X > 100 is found in Antic_In of blocks, it is a way to pass the value for next iteration if there is a redundancy based on it. The Avail_Out of all the blocks consists of the expression they generated and the ones that were generated by its dominators. This is how the dominator tree is used to handle the scope of the expressions that are available for the blocks.

Fig 9c: Antic_In and Avail_Out

For further clarifications regarding these sets, it is suggested to go through the example defined in thesis[1] which gives a clear visual representation of how and what expressions go where in these sets with a small toy example.

Three operations that are possible on these sets are:-

o insert () – to insert a value expression in the list. o val_insert () – to insert the value expression in the list if there is no other entry in

the list having the same equivalent VALUE in the hash table. o val_replace () – to replace the value expression in the list if there is another entry

in the list having the same equivalent VALUE in the hash table.

The above-mentioned operations of the lists improve the algorithm by removing the need to calculate both Avail_In and Avail_Out similarly for Antic_In and Antic_Out. This way there is only a need to calculate Avail_Out and Antic_In. Thereby simplifying the task of calculating only five from the seven sets discussed for each block.

Now that the framework is defined by the creation of the hash table and the storage units for the complete program the next task is to define the way to populate these sections. Once this information is collected the next steps would be to eliminate the partial redundancy by making them fully redundant and remove all the full redundancies.

9

The above-mentioned process is carried out in four steps:-

I. BuildSets Phase_1

This phase is responsible for populating the global table and the sets corresponding to each block. Each block is visited in the top-down traversal of the dominator tree. Before processing the statements in the block its corresponding Exp_Gen, Tmp_Gen, and Phi_Gen are initialized with empty sets and the Avail_Out is assigned to the Avail_Out of the dominator in the dominator tree. This assignment of Avail_Out from the dominator in a way builds the scope oriented vision of the expressions that are available to a block.

And each statement is processed based on its type and the target variables are added to the global table if they are functional by looking up their VALUE from the table or else they are just added with a new entry.

▪ If the statement has a side effect then they are considered as temporaries that are not considered in any form for optimization. So, their target variables are collected in the Tmp_Gen and later they are used as an eligibility criterion for eliminating redundancies by checking if they are not embedded in the expressions being eliminated in the clean step. A part of their need would be further discussed in the next phase.

▪ If the statement is a Phi statement then the target variable is collected in the Phi_Gen. This is very important in scenarios where an origin of an expression has to be retracted, then the phi variables in them have to be translated to the phi arguments that correspond to the path they came from. This can be understood clearly in the phi_translate step in the next phase.

▪ If the statement is of variable assignment form then the target variable is inserted in the table with the VALUE of source variable. This way both the variables share a common VALUE in the table and can be identified as value equivalent in later stages of the algorithm. The source variable is inserted in Tmp_Gen along with the val_insert operation in the Exp_Gen. This ensures that only one entry of expressions are cached and helps in the creation of partially redundant expressions for the block if required. Which otherwise would initiate the problem of having multiple variables in the memory per block having the common value. And a leader needs to be always assigned to eliminate the non-uniformity that might arise if they are chosen randomly.

▪ If the statement is of expression assignment form then the value expression of the source expression is created by translating the variables in the expression with their values by lookup operation from the table. Then the value expression is similar to variable assignment statement with an additional step of adding all the operands and the expression as a whole to the Exp_Gen with val_insert. Finally, the target variable of all the statements is inserted in the Avail_Out using val_insert growing the available expression knowledge of the block being worked with.

10

II. BuildSets Phase_2

{e | e ϵ Antic_In [ succ0 ( b ) ] Λ Ɐb’ ϵ succ (b), ꓱe’ ϵ Antic_In [b’] | lookup ( e ) = lookup ( e’ ) } if |succ(b)| > 1 Antic_Out [b] = Phi_translate (Antic_In [succ (b)], b, succ (b)) if |succ(b)| = 1 Antic_In [b] = clean ( canonEXP ( Antic_Out [ b ] U Exp_Gen [ b ] – Tmp_Gen [ b ] ) ) Equation 1: Antic_In and Antic_Out equations [1]

The above equations are used to find Antic_In of all blocks in this phase. As these equations are mutually recursive it is necessary to compute both Antic_In and Antic_Out even though only one is saved in the memory. This phase calculates Antic_In for all the blocks by visiting them in the post-order traversal of the dominator tree until there is no change in them. As the halting condition of this phase is till convergence few bigger programs might take more iterations, for which VanDrunen[1] proposes to stop after 10 iterations as most programs converge within them. It is very important to understand Equation 1: and what they actually mean and why they are important for partial redundancy. First covering the granular details of the functions used in the equation:-

o clean – Given a set of value expressions they are filtered by removing those with the side effect variables explicitly or implicitly as a part of them as operands.

o phi_translate – Given a value expression from the successor of a block its operands are translated to its phi arguments of the block based on the path if they contain phi variable in them.

o find_Leader – Given a value and a list of value expressions it returns the value expression that matches the value that is being passed.

o succ – refers to the successor of a block in control graph o U, Λ, – - refers to Union, intersection and difference operations in sets o canonEXP – translates the expressions by making a consistent choice of variables if

more than one having same value is in memory. But this step is omitted as with the use of val_insert.

Going over the equations in further detail Antic_Out for each block is calculated based on the number of successors each block has. If the number of successors is one then there is no scope of partial redundancy. In that case, the phi_translate operation is applied to the Antic_In of the successor block as except for the variable changes they both are same semantically. And in case, of more than one successors then only those expressions are collected from Antic_In’s that are common among all the successors. Antic_In is next generated from the Antic_Out by just performing a union operation with the Exp_Gen and removing the Temporaries from the result as they only refer to the target variables

11

and not the expressions that they represent. Finally a clean is performed removing all the expressions containing side effects embedded in them.

III. Insert Having the Antic_In for all the blocks now the origin of all these expressions can be retraced through the predecessors. This way if there exist partial redundancies in the code then only blocks having more than 1 predecessors have such possibilities. For which each merging block is visited in the top-down traversal of the dominator tree till convergence, i.e. until no insertions. Antic_In is translated as linked lists from sets to maintain the topological order of its contents. All the value expressions from Antic_In are checked with their presence in the predecessors to check which predecessor it has come from. This way the track for the availability of these statements in the predecessor is maintained. If any predecessor misses these value expression in their Avail_Out then they are partially redundant. In that case, new statements are created using the leaders of the operands in the partially redundant statement s as s’. The target variable of the statement s and s’ is added as phi arguments and new phi statement is created and assigned to fresh phi variable in these blocks. If a new phi statement is created then the phi variable is added by val_replace operation in the Avail_Out of the block in order to insert and eliminate the statement that depends on this expression in next iteration and in the eliminate phase.

IV. Eliminate

The eliminate phase if run skipping the insert phase then it would perform the equivalent of Common Subexpression Elimination. Each block is visited in the top-down traversal of the dominator tree. And for each statement in these blocks, the leader for the target variable is searched with the value from the global table in Avail_Out of the block. If the leader is not same as the target variable in the statement being handled then the source expression of the statement is replaced with this leader. This way the redundant expressions are eliminated by directly assigning the variable that has the computational equivalent value already stored in them from the prior parts of the SSA.

12

5. Design Values and Expressions The above section discusses the implementation of GVN-PRE[1] in Jikes RVM[4] for Java language. There are various differences in the types of expressions and control flows that are allowed in Java and Standard ML. In order to implement this pass in MLton few changes are made throughout the algorithm proposed which would be discussed in further detail in the following sections. The SSA expressions handled by GVN-PRE are either var | var op var. These expressions are translated to their corresponding value expressions in the form of t | v op v where the variables in the composed expressions are replaced with their corresponding value. In order to represent the MLton ‘s SSA expressions similarly, it is important to understand them. Types of expressions in MLton’s SSA

structure Exp = struct datatype t = ConApp of {args: Var.t vector, con: Con.t} ---------- Constructors | Const of Const.t ----------Constants | PrimApp of {args: Var.t vector, --------- Primitive Expressions like +,-,*,/ prim: Type.t Prim.t, targs: Type.t vector} | Profile of ProfileExp.t --------- Expressions used for debugging | Select of {offset: int, --------- Get index value from Tuple tuple: Var.t} | Tuple of Var.t vector --------- Tuple | Var of Var.t --------- Primitive Expressions end end

Source mlton/mlton/ssa/ssa-tree.sig

Though there are various new expressions to be handled but all of them can be translated to their value expressions in the same way. Translating the above expressions into GVN-PRE[1] based values and expressions are done in the following manner as represented in the below structure. A structure VExp is created to represent Values {VExp.value} and value expressions {VExp.t}. VExp.value is composed of three sections vexps-> denoting all the value expressions holding the same value in a list, id-> integer id to compare the equality of two and vType-> representing the type of the expression. Except for Exp.Pofile expression all others are considered for optimization and their VExp equivalence are the translation of their inner variable compositions with VExp.value. Few helper functions like: - layout – to define a way to present the datatype at the time of printing, equals – to compare two objects of the datatype, hash – to hash the datatype are added as member functions of the structure.

13

structure VExp = struct

datatype t = VConst of Const.t | VPrimApp of {prim: Type.t Prim.t, args: value vector, targs: Type.t vector} | VConApp of {args: value vector, con: Con.t} | VSelect of {offset: int, tuple: value} | VTuple of value vector | VVar of Var.t and value = Value of {vexps: t list ref, id: int, vType: Type.t}

end end

VExp structure representing the values and expressions

Benefits of the above representations:-

1. The representation of values and expressions with a new datatype not only modularizes the development but also removes the convoluted manner of distinguishing between SSA expressions, value expressions, and values.

2. With the similar way of representation between Exp and VExp, the translation between them and handling them with pattern matching are simple and straight forward.

3. Caching the type of the variable along with its value can be directly used at the time of new statement creations.

4. New id created with a simple counter starting from 0 is assigned to each new value being created with the newValue function. This id narrows down the complexity of comparing two values to O(1) with a simple integer equality check, which otherwise would have been the case of comparing two value expression lists. Id being unique again helps in hashing the value in constant time.

ValTable A global hash table is created using hashSet from one of the utility libraries in MLton as shown below with the structure ValTable. All three functions lookup, lookup_or_add and add is created for the same as in GVN-PRE[1].

structure ValTable = struct

val table: {hash: word, vexp: VExp.t, values: VExp.value } HashSet.t = HashSet.new {hash = #hash}

end end

ValTable Structure to represent the global hash table

The records in the table are tuples consisting of three main parts hash: the hash value of the VExp.t that is being represented as the KEY in the hash table as vexp: and values: VExp.value as

14

VALUE of the hash table. Many expressions are commutative and the order of their operands does not change their actual equivalence. In order to handle such expressions, the operands are sorted in the ascending order of their value ids and then hash operation is performed on them. This does not change the actual code being optimized by having a uniform representation of expressions in their non-uniform occurrence’s in the code but just is the way of identifying them being equivalent.

A minor change that is a bit different from the original implementation of this table in GVN-PRE[1] worthy of noting is that the VALUE is represented using list data-structure whereas the proposed algorithm used a hashSet instead. Although this might result in the increase in time complexity of the retrievals and contains operation, but the main ideas were to keep the table simple instead of having a nested hashSet form. Even with the use of the list, the performance of lookup_or_add operations to the list is not affected as push operation is performed rather than append.

LabelInfo Each block has a record of sets storing various information as discussed in the related work like Tmp_Gen, Avail_Out, Antic_In and others. So instead of hashing the block and then creating separate hashSets, all the necessary items are packed together as a complete package in a single structure called LabelInfo. The major change with this approach is that all the sets are represented in the form of Lists. One such advantage of this representation is that the sets to lists step involved in BuildSets Phase 2 and insert based on topological ordering can be avoided as all the elements by default will be stored in the desired order. The time complexity is a bit sacrificed with the implementation of functions like val_insert, val_replace from O(1) to O(n), where n is the number of elements in the list at any point of time. Even though this is the case insert will still be done at the constant time with the use of push operation. Apart from the Five sets discussed before few more of them are also added to avoid further computations at the later stages of the algorithms run. Before going to the algorithm the code to be optimized is scanned once and dominator tree is generated from its control graph. At this step, the dominators of each block are saved along with the predecessor and successor info from their transfers. Another important addition to the structure is the killGen a separate list to collect the side effect target sources, which would be discussed in greater detail in the implementation section. As the insert phase runs until convergence the SSA has to be built in an incremental manner which might build a block n times if n new statements are added to it in its complete run. This step can be optimized by just storing the information of the newly added statements along with other lists in this structure and they can be later used once to build the complete code. For this purpose three new lists are used in the structure statementsGen to cache the statements that need to be added in the blocks causing partial redundancy, gotoArgsGen to cache the variables that need to passed as results to the successor block from the statements being added and finally

15

the argsGen for the variables that need to be added to the successor block to catch the value being passed from the new goto statements created. Goto transfers are a way of mimicking the phi statements in MLton’s SSA.

structure LabelInfo = struct datatype t = T of {dom: Label.t option ref, predecessor: Block.t list ref, successor: Block.t list ref, tmpGen: Var.t list ref, killGen: Var.t list ref, expGen: VExp.t list ref, phiGen: Var.t list ref, availOut: VExp.t list ref, newSets: VExp.t list ref, availIn: VExp.t list ref, statementsGen: Statement.t list ref, argsGen: (Var.t * Type.t) list ref, gotoArgsGen: Var.t list ref, anticIn: VExp.t list ref} end

end LabelInfo Structure to represent the block related lists

Each block in SSA has a unique label name assigned to it. The object of LabelInfro structure can be added as a property of these labels by creating a new object and adding information related to these blocks there. Block’s LabelInfo object can be retrieved by passing the label name to the getLabelInfo method that is set as a getter function. The below code defines the way to add objects to labels property list in MLton and a way to access and modify them.

val {get = getLabelInfo, ...} = Property.get(Label.plist, Property.initFun (fn _ => LabelInfo.new ())) val labelInfoObj = getLabelInfo labelName

Create new label properties and access them

The new function in the code assigns all the structure members with brand new empty lists and a separate function make is used to access these members. This getter function is equipped with the ability to return either reference or copy of these lists which otherwise would become cumbersome later.

16

6. Implementation The implementation of PRE in MLton is completely based on the GVN-PRE[1]. The details covered in this section would be based on the implementation variations between the proposed approach and this project. The major changes are based on the differences in their SSA’s that required minor tweaks while implementing and the scope of extending the same approach for other expressions in MLton’s SSA. For few functions, although their functionality and their purpose were known in the algorithm but their implementation were open to assumptions, which were handled and fine-tuned as per the needs for GVN-PRE as a whole to work accurately. These assumptions, in turn, had to be handled differently in few portions of the insert phase which was addressed with a simple fix. The implementation of this pass begins with the integration of GVN-PRE just before CSE with the intention of, letting CSE clean all the full redundancies if not caught by GVN-PRE. CSE along with removing the global redundancies also performs a few more fancy optimizations with canonizing the expressions by ordering the operands in the order of them in the registers. Apart from this Arith transfers, related optimizations are also performed effectively. Keeping all this factor in mind rather than eliminating CSE from MLton while adding this new pass it was kept as a backup option. As mentioned before the SSA structure is a bit different between that of VanDrunen’s[1] and in MLton. There are a lot of control transfers in MLton’s SSA and this allows multiple inflows and outflow from each block. Types of transfers in MLton are as follows:-

structure Transfer: sig datatype t = Arith of {args, overflow: Label.t, prim: Type.t Prim.t, success: Label.t, ty: Type.t} …………….. Arithmetic transfer | Bug …………….. Error | Call of {args: Var.t vector, func: Func.t, return: Return.t} …………….. Function calls | Case of {cases: Cases.t, default: Label.t option, test: Var.t} …………….. case control | Goto of {args: Var.t vector, dst: Label.t} …………….. goto block transfer | Raise of Var.t vector …………….. raise exception | Return of Var.t vector …………….. return result | Runtime of {args: Var.t vector, prim: Type.t Prim.t, return: Label.t} …………….. runtime exception end

Source mlton/mlton/ssa/ssa-tree.sig

The goto transfer is the way of handling phi’s in MLton’s SSA tree structure. Instead of merging the versions of a variable from different branches and then assigning it to a new one. Here, these

17

variables are passed as results of a block to its successor. And the successor block receives these results as block arguments as shown in Fig 10. B1 B1

B2 B3 B2 B3

B4 B4(A_4)

Fig 10: code snippet without and with goto transfers from left to right

With all the transfers having inflows and outflows to blocks it becomes difficult to add the statements in the blocks and pass the results to the successor block, especially in the case transfers. This can be handled by creating landing pads to such blocks by breaking the critical edges. For this purpose, a breakCriticalEdges pass is introduced before the GVN-PRE pass in the optimization phase as shown in Fig 11. All the unused critical edges later get cleaned up by the shrink function after the GVN-PRE pass. Shrink is a function that runs after each optimization pass in MLton to perform few simple compile time reductions to optimize code related constant folding, copy propagation, dead block elimination, etc. This helps in GVN-PRE mainly for eliminating statements after the eliminate pass where all the statements performing redundant computations get assigned with variables already holding this computation. Instead of assigning a variable and then using them, all the expressions depending on this computation directly gets replaced with the source variable.

Fig 11: how critical edges are broken with landing pads [1]

A_2 = A_1 + 1

A_1 = 1

A_3 = A_1 + 4

A_4 = Φ (A_2,A_3)

A_2 = A_1 + 1

Goto B4 (A_2)

A_1 = 1

A_3 = A_1 + 4

Goto B4 (A_3)

Landing

Pad

18

After the break critical edges pass now the actual GVN-PRE as proposed is performed. The SSA tree to be optimized consists of two parts globals and functions. Globals are the global variables that have the program scope and the functions that form the code to be optimized. First, each global are loaded in the global table and then the set of LabelInfo lists are calculated considering them to be a part of the global block separately common for all the functions. With this kind of approach, the size of LabeInfo lists is constrained for all other blocks. Which otherwise would be the case of all blocks carrying the information of these globals and increasing the space complexity exponentially. The one thing that needs to be handled precisely later at different stages of the algorithm is to perform a lookup operation in blocks LabelInfo first and if its absent the same lookup has to be performed in the global block’s LabelInfo. Once the globals are loaded in the global table their role in this optimization pass is complete. Next, all the functions are optimized one after the other individually without considering them as a whole interacting with each. For the functions the processing is done in the following manner: - all the necessary information for the phases of GVN-PRE are first collected as a part of preprocessing phase, then it is followed by the four phases of the algorithm as mentioned in the related work section. Finally, the optimized code is built and passed as input for the next pass. As a part of preprocessing phase the dominator tree of the function is generated from which the dominator’s of all the blocks are assigned, along with collecting the information of the list of successors and the list of predecessors of the blocks from the control flow graph. While collecting all these information the breadth-first order and the depth-first order of the blocks are stored as separate lists, which would be very helpful in calling different phases of the algorithm with simple map operation on these lists. Having set up the complete infrastructure and the required parts to begin the implementation of GVN-PRE, the BuildSets phase 1 is applied to all the blocks in the BFS order of the function blocks. The blocks in the functions consist of four components.

val Block.T {args,label,statements,transfer} = block

Going back to the working of BuildSets phase 1 from related work they are handled based on the statement type. The args in the block are kind of phi statements and they need to be handled accordingly. And the rest of the statements of the block could be of three other types such as variable assignment, expression assignment and side effect statements and their processing is done exactly in the same manner as mentioned previously. The expressions can be distinguished from the side effect or not with the help of isFunctional function. The transfer is referred to the outflow from the block and they are not processed throughout the pass and they are left for the future scope. With the execution of BuildSets phase 1 all the expressions of the SSA would have its entry in the global table and each function block would have the following lists populated Avail_Out, Tmp_Gen, Kill_Gen, Phi_Gen and Exp_Gen. The Kill_Gen list is filled with the target variables of the side effect statements. These variables would later be used in the BuildSets phase 2 at the time of cleaning Antic_In the list of value

19

expressions that have these variables embedded in them. The next step would be to generate the Antic_In from the list of Equations 1 with the BuildSets phase 2. Similar to BuildSets phase 1 the BuildSets phase 2 is run on the DFS order list of the functions blocks with map operation where they are run till convergence, i.e. until no change in the Antic_In of any blocks. The only variation in the implementation of BuildSets phase 2 from the one proposed in GVN-PRE[1] would be based on 2 things: - First, VanDrunen proposes to stop the iteration after 10 rounds but instead of constraining the iterations they are allowed to converge on their own indefinitely. And the second variation is due to the implementation decisions made for the functionality of clean and phi_translate as they were not clearly described with an example or diagram. So, the implementation of clean and phi_translate are as follows:- The clean method takes Antic_In and Kill_Gen of the block as the input and returns a processed list of Antic_In. The main reason of performing this is to remove the side effect expressions that are not explicitly based on them but they have them deeply embedded in their operands.

Eg:- Kill_Gen = [ A ] Antic_In = [ VZ + VY , A, VA + VB, VM + VN ] VZ = [ VB + VN, VN + VM ] VB = [ A, VC + 2 ] Processed Antic_In = [ VM + VN ]

Consider the above example where the variable A is a side effect expression that needs to be removed from all the value expressions in Antic_In. Then even though VZ + VY are not directly affected by VZ but it internally holds the value of VB and VB holds the value of A. This way all the expressions are processed by going over their values and the value expressions that the value is composed of until they all are checked thoroughly. If the check finds a presence of Kill_Gen variables then the expression is deleted from the Antic_In or else they are left as it is. In the Phi_Translate method, the processing is not done at deeper levels as in Clean, even though they might impact the insert phase due to this approach but they can be easily handled by a small fix. The input to Phi_Translate is the Antic_In from the successor, self-block and the successor block. All the value expressions in the Antic_In are translated to a new list of value expressions by replacing the expressions containing successor block arguments with the self-block goto arguments correspondingly.

Eg:- Self-Block Goto Args = [ X, Y ] Successor Block Args = [ A, B ] Successor Antic_In = [ VA + VM, A, VB + VM, VL + VM ] VL = [ VA + VB, VZ ] Returned Antic_In = [ VX + VM , X, VY + VM, VL + VM ] Added expressions to Global Tables VX + VM = [] VY + VM = []

20

The above example illustrates how the Antic_In are translated with just shallow translate. The expressions VA + VM is translated to VX + VM by replacing VA -> VX similarly for others. Even though VL internally holds the value expressions made with VA as operands but they are left as it is. To perform this translation effectively two helper methods that translate arguments to goto arguments and argument values to goto argument values. With this methods, all the elements of the Antic_In are directly translated if they explicitly hold the block arguments in the outer levels. For all the new value expressions created during this translation, they are loaded in the global table by creating a new entry for themselves like for VX + VM. With the completion of BuildSets phase 2, Antic_In for all the blocks is ready to be used to insert the new statements in the paths that cause partial redundancy by making them fully redundant which can all be completely clean by the eliminate phase. Both the implementation of the insert and the eliminate phase are almost same as the proposed manner except for building the new SSA tree structure after each insertion is delayed until after all the insertion in order to minimize the time complexity of building them. This creation of new SSA tree after the insert phase is defined as Build Blocks phase where the fully redundant code is generated and passed as input for the eliminate phase after which they are passed on as input for other optimization passes. Another major change in the insert phase is due to the difference between the thesis SSA tree structure and that in MLton. Instead of creating new statements in the missing paths and creating Phi statements in the merging blocks. The target variables of the statements created are passed as arguments to the goto transfer of the same block. And the target variable of the phi statements is added as arguments to the merging blocks where the passed results from the predecessor's goto transfers can be caught as results to remove redundancies. For all the expressions in the Antic_In of the merging blocks that are considered to be partial redundant and whose values are being missed in few of the predecessor's control paths. While they are being inserted in the predecessor blocks by the creation of new statements, their operands variables are generated from its self Avail_Out. During this phase of the algorithm if the operands values are not part of the Avail_Out then it might be the case where the Antic_In is not in the order of topological ordering of expressions and also Phi_Translate function being not deep translate could play a vital role in this scenario. A simple fix for this issue could be just to check if the number of arguments in the expression being inserted and its availability count in the block being inserted in. If both these counts match then the insertion is independent and they proceed with the steps that follow statement insertions otherwise, the statements insertion is postponed till its dependent operands are available. This count condition helps the algorithm to avoid adding arguments to the goto transfers and successor arguments where the insertion actually did not occur in the specific iteration. This can be visualized from the example in Fig 13 where three partially redundant expressions are all dependent and where the topological ordering is required. The non-topological ordering of expressions in Antic_In could be caused due to multiple iterations of BuildSets phase 2 carried out in order of blocks that are not restricted to avoid this

21

issue. A better fix would be to have a function that would arrange the Antic_In in the topological order and then be running the insert phase with them. As this issue was found at a later stage of implementation it is left open for future improvements. Even though few of the iterations get wasted by dealing with expressions that are not potentially valid for insertion but this pass being

till convergence would not hinder the desired output from this phase. After the insert phase, the SSA tree is generated with Build Blocks phase where the previous statements and the new statements are all appended in the block along with additions to the goto arguments and block arguments. Finally, eliminate removes all the redundancies by looking up the global table and the leaders in the Avail_Out of the blocks by assigning the redundant expressions with its known variable in scope.

After testing the performance and correctness of the above-discussed approach the same method was extended to handle other additional expressions in all of the phases like ConApp, Tuple, Select and Const. This way the redundancy with huge data structures are also eliminated that can save register pressure by a huge margin. All the above expressions being immutable in Standard ML had the scope to be a part of this redundancy elimination process without any additional work. Along with this the Array_Length and Vector_Length assignment statements were also optimized for the arrays and vectors created by Array_array and Array_toVector function, by having an additional value expressions addition to the global table for the size of these structures. Else they might have been skipped as they are derived from Array_toVector and Array_array which are side effect statements. This optimization is replicated from the commonSubexp pass in MLton and was added to compare both the pass without any difference in their functionality.

Fig 12: GVN-PRE Flow Diagram

Fig 13: Partial Redundancy with topological dependencies

A = 3 + (Fact 40)

B = A + 10

X = B + 5

L = 3 + (Fact 40)

M = L + 10

N = M + 5

22

7. Analyzing GVN-PRE The below example was created as a practice problem to create an example having partial redundancy with a right amount of complexity involved within it. This example was manually run by generating Avail_Out and Antic_In working out the BuildSets phase 1 and phase 2. This step helped in understanding the algorithm clearly and implementing it correctly. Example 1

Fig 14a: Un-optimized SSA 14b: After CSE 14c: After GVN-PRE

Consider the expressions x_589 + x_591 of block L_291 in Fig 14a the same computation is performed in block L_293. This is the simplest form of partial redundancy with the if-else statement. The same example also contains global redundancy between block L_293 and L_295 with the expression x_596 + x_589. After CSE in Fig 9b, the global redundancy is eliminated with the elimination of statement computing x_603. Similarly, after GVN-PRE in Fig 14c, the global redundancy is eliminated along with the elimination of partial redundancy. A new statement is added to block L_292 and the results are passed in the goto transfer as x_895. And the passed value is caught with the argument in block L_293 from x_896 helping in the elimination of statement finding x_600 from Fig 14a. Working of GVN-PRE with the above example. Block L_291 Avail_Out: [x_595, x_594, x_593, x_591, x_592, x_589, x_590, x_587, x_588, x_584, x_585, x_586] Antic_In: [Word32_add (1166, 1171), Word32_add (1171, 1173), Word32_sub (1174, 1166), global_143, WordU32_lt (145, 1175), Word32_add (1166, 1168), x_591, x_589]

23

Block L_292 Avail_Out: [x_593, x_589, Word32_add (1166, 1169), x_591, Word32_add (1166, 1168), Word32_add (1171, 1173), Word32_sub (1174, 1166), global_143, WordU32_lt (145, 1175)] Antic_In: [x_593, x_589, Word32_add (1166, 1169), x_591, Word32_add (1166, 1168), Word32_add (1171, 1173), Word32_sub (1174, 1166), global_143, WordU32_lt (145, 1175)]

Block L_293 Avail_Out: [x_593, x_589, Word32_add (1166, 1169), x_591, Word32_add (1166, 1168), Word32_add (1171, 1173), Word32_sub (1174, 1166), global_143, WordU32_lt (145, 1175)] Antic_In: [x_593, x_589, Word32_add (1166, 1169), x_591, Word32_add (1166, 1168), Word32_add (1171, 1173), Word32_sub (1174, 1166), global_143, WordU32_lt (145, 1175)]

Portion of Global Table x_593 1169 | x_589 1166 | Word32_add (1166, 1169) 1183 | Word32_add (1171, 1173) 1174 | Word32_sub (1174, 1166) 1175 | x_591 1168 | global_143 145 | x_592 1167

As block L_293 is the merging block all the value expressions from Antic_In are visited and checked if they are partially redundant. The highlighted expression gets identified as a presence of partial redundancy through L_292. So, to insert this expression in block L_291 the variables holding the values of 1166 and 1168 are found with the find_Leader operation. These values correspond to the variables x_589 and x_591 which are already available in the Avail_Out of the block. So a fresh variable x_895 is generated and the statement is added to block L_291 holding the expression Word32_add (x_589, x_591). This result is passed as the argument from all the predecessors of Block L_293 that is L_291 and L_292. The goto transfer of block L_292 returns (x_595, x595) and block L_291 returns (x_593, x_895). The second element in both these transfers holds the redundant expression. In order to catch this result in block L_293, a fresh variable x_896 is created and added to its arguments. The two arguments passed from block L_292 are redundant but they correspond to the value of different variables that gets assigned. Example 2 After the implementation of GVN-PRE and checking its performance it was necessary to come up with a good enough example containing a proper scenario where GVN-PRE performs better than CSE. For this, the example had to involve a few complex expressions other than the ones with normal arithmetic operations. So, it was considered to have a partial redundancy with lists that had a list of values appended to it in a loop after the redundancy where the value being appended has actually been computed through one of the control paths before as shown in Fig 15.

Fig 15: Example 2 Version 1

24

.

fun loop ls = let val [a,b,c] = ls in if (a*(~1)) < 0 then [a+b,b+a,c+a] else ls end fun tupEval ls res = let val tup = loop ls in if (length res) < 3000 then tupEval ls (tup::(res)) else res end val first = if n mod 2 = 0 then loop [10,9,20] else [10,9,20] val ls = tupEval [ 10, 9, 20] [] val start = foldl (fn (x,acc) => x + acc) 0 first val res = (List.foldl (fn ([a,b,c],z) => a+b+c+z) start ls)

Standard ML code for Example 2

25

As shown in Fig 16a if the partial redundancy is eliminated then the portion of the list that gets appended at different iterations are based on the previous computation and they work with their reference value. But if this redundancy is not eliminated then a set of the new list is generated at each iteration and they point to different memory locations and increase the register pressure by computing the list to be appended from scratch each time.

Two more versions of the same example were carried out. Version 2 by making the computation of X in Fig 16 global to both the control transfers of the if-else statement. And the Version 3 by making the computation of X globally redundant by making the computation of X available through the missing branch as well.

Chart 1 Run-Time Ratio Comparison of GVN-PRE & CSE for Example 2

The results from Chart 1 clearly shows the better performance of GVN-PRE than CSE by 3 times. And this performance is not only for partial redundancy scenario but also for global redundancy when the statements are globally available from leafs of all the control paths.

No Optimization CSE GVN-PREBoth CSE & GVN-

PRE

Version 1 1 1 0.37 0.36

Version 2 1 0.37 0.37 0.36

Version 3 1 1.01 0.35 0.36

0

0.2

0.4

0.6

0.8

1

1.2

Example 2 Results

26

8. Results Run Time Ratio

benchmark MLton0 MLton1 MLton2 MLton3

barnes-hut 1 1.004 1.012 1.008

boyer 1 0.986 0.992 0.992

checksum 1 0.832 0.836 0.83

count-graphs 1 1.026 1.022 1.02

DLXSimulator 1 0.99 0.996 1

even-odd 1 1.002 1 1

fft 1 1 1.004 1

fib 1 1.004 1 1

flat-array 1 1.008 1.002 1

hamlet 1 0.998 0.948 0.958

imp-for 1 0.996 0.996 0.996

knuth-bendix 1 1.022 1.044 1.036

lexgen 1 0.996 1.002 0.984

life 1 0.974 0.984 0.976

logic 1 1.004 1.006 1.01

mandelbrot 1 0.91 0.912 0.912

matrix-multiply 1 0.89 0.892 0.892

md5 1 0.84 0.908 0.822

merge 1 1.002 1.004 1.002

mlyacc 1 0.998 0.982 0.986

model-elimination 1 1.006 1.012 1.008

mpuz 1 0.996 1 0.992

nucleic 1 0.968 0.966 0.972

output1 1 1.08 1.068 1.072

peek 1 0.994 0.984 0.986

psdes-random 1 0.88 0.876 0.878

ratio-regions 1 0.976 0.98 0.97

ray 1 1.008 1.004 0.988

raytrace 1 1.034 1.01 1.038

simple 1 0.898 0.982 0.94

smith-normal-form 1 0.998 1 1

tailfib 1 0.998 0.992 1.002

tak 1 1.006 1.002 1

tensor 1 1.004 1.002 1

tsp 1 0.946 0.946 0.946

tyan 1 0.994 0.994 0.992

vector-concat 1 1.004 0.998 1.002

vector-rev 1 0.98 0.98 0.984

vliw 1 0.942 1.002 0.948

wc-input1 1 1 1 1

wc-scanStream 1 0.976 0.972 0.978

zebra 1 1.01 1.014 1.01

zern 1 0.97 0.972 0.972

example 1 0.996 0.36 0.364

Table 1: Run-Time Ratio of benchmark tests

27

Run Time


barnes-hut 27.742 27.816 28.058 27.928

boyer 49.202 48.444 48.812 48.834

checksum 33.426 27.862 27.884 27.864

count-graphs 39.946 41.038 40.868 40.718

DLXSimulator 33.282 32.994 33.086 33.238

even-odd 39.042 39.124 39.074 39.098

fft 31.096 31.098 31.204 31.112

fib 14.852 14.908 14.866 14.862

flat-array 22.696 22.896 22.75 22.72

hamlet 33.25 33.194 31.524 31.828

imp-for 24.972 24.86 24.832 24.872

knuth-bendix 34.406 35.136 35.848 35.612

lexgen 33.29 33.042 33.298 32.642

life 39.29 38.346 38.636 38.33

logic 32.77 32.946 32.952 33.112

mandelbrot 39.22 35.8 35.844 35.822

matrix-multiply 32.236 28.72 28.74 28.74

md5 34.938 29.354 31.664 28.782

merge 34.044 34.086 34.182 34.132

mlyacc 33.066 32.964 32.402 32.4

model-elimination 36.282 36.45 36.62 36.532

mpuz 30.058 29.834 30.07 29.83

nucleic 32.976 31.986 31.912 32.098

output1 28.982 31.316 30.966 31.102

peek 34.096 33.926 33.584 33.646

psdes-random 39.572 34.85 34.636 34.718

ratio-regions 47.244 46.194 46.126 45.836

ray 38.944 39.112 38.76 38.112

raytrace 34.178 35.262 34.492 35.434

simple 31.69 28.492 31.172 29.824

smith-normal-form 36.606 36.538 36.6 36.588

tailfib 36.812 36.766 36.462 36.864

tak 29.878 30.048 29.95 29.852

tensor 39.308 39.506 39.302 39.244

tsp 36.814 34.804 34.89 34.856

tyan 30.54 30.344 30.352 30.384

vector-concat 35.574 35.68 35.58 35.63

vector-rev 28.982 28.434 28.356 28.55

vliw 29.426 27.654 29.468 27.846

wc-input1 42.606 42.588 42.616 42.612

wc-scanStream 21.606 21.062 20.984 21.142

zebra 30.356 30.566 30.81 30.67

zern 29.104 28.262 28.27 28.322

example 15.738 15.658 5.7 5.715

Table 2: Run-Time of benchmark tests

28

Binary Size


barnes-hut 178,191 178,479 178,607 178,591

boyer 249,229 249,165 249,229 249,165

checksum 117,069 116,845 116,845 116,845

count-graphs 145,533 144,541 144,685 144,685

DLXSimulator 214,136 213,240 214,744 214,200

even-odd 116,765 116,765 116,765 116,765

fft 150,119 144,119 146,247 144,119

fib 116,717 116,717 116,717 116,717

flat-array 116,493 116,493 116,493 116,493

hamlet 1,450,848 1,449,872 1,457,840 1,457,536

imp-for 116,525 116,525 116,525 116,525

knuth-bendix 189,672 189,736 190,088 189,992

lexgen 299,207 297,303 298,119 298,135

life 141,917 140,445 141,229 140,477

logic 197,581 197,581 197,581 197,581

mandelbrot 116,573 116,573 116,573 116,573

matrix-multiply 118,989 118,861 118,861 118,861

md5 147,928 147,608 147,816 147,640

merge 118,173 118,173 118,173 118,173

mlyacc 659,431 661,799 665,127 665,127

model-elimination 818,966 819,958 820,966 820,470

mpuz 122,493 122,525 122,525 122,525

nucleic 302,133 300,213 300,213 300,213

output1 154,364 154,444 154,668 154,668

peek 151,928 151,960 152,264 152,264

psdes-random 120,861 120,909 120,909 120,909

ratio-regions 146,253 143,725 144,685 144,221

ray 253,750 252,374 252,726 252,710

raytrace 384,504 381,656 382,424 382,392

simple 368,217 350,889 369,225 358,025

smith-normal-form 317,285 315,445 318,245 316,789

tailfib 116,573 116,573 116,573 116,573

tak 116,701 116,701 116,701 116,701

tensor 185,204 182,452 182,740 182,740

tsp 160,240 160,192 160,224 160,224

tyan 227,400 227,624 228,488 228,472

vector-concat 118,317 118,317 118,317 118,317

vector-rev 118,749 118,109 118,749 118,109

vliw 521,505 518,929 518,193 516,209

wc-input1 180,963 180,227 180,547 180,579

wc-scanStream 190,659 189,987 190,179 190,307

zebra 227,800 227,704 227,736 227,736

zern 156,201 155,401 155,497 155,497

example 118,829 118,829 118,653 118,653

Table 3: Binary Size of benchmark tests

29

Compile Time


barnes-hut 2.67 2.46 2.71 2.75

boyer 2.66 2.69 4.81 4.88

checksum 2.08 2.24 2.05 1.98

count-graphs 2.24 2.22 2.3 2.38

DLXSimulator 2.64 2.66 3.95 4.01

even-odd 2.06 2.08 2.02 2.05

fft 2.19 2.18 2.19 2.1

fib 2.03 2.08 2.18 1.96

flat-array 2.12 2.1 2.04 1.98

hamlet 11 12.03 112.13 113.4

imp-for 2.06 2.07 2.06 2

knuth-bendix 2.42 2.46 2.62 2.74

lexgen 3.08 3.11 4.84 4.77

life 2.21 2.2 2.16 2.1

logic 2.45 2.52 2.74 2.78

mandelbrot 2.07 2.11 2.11 1.97

matrix-multiply 2.13 2.12 2.09 2.02

md5 2.24 2.26 2.44 2.46

merge 2.08 2.06 2.05 2.01

mlyacc 6.28 6.14 16.23 17.08

model-elimination 6.18 6.4 20.38 19.68

mpuz 2 2.11 2.1 2.04

nucleic 3.44 3.46 10.67 10.81

output1 2.25 2.28 2.52 2.41

peek 2.24 2.27 2.58 2.5

psdes-random 2.08 2.13 2.04 2.02

ratio-regions 2.4 2.36 2.38 2.41

ray 2.88 2.81 3.28 3.44

raytrace 4.04 4.02 5.85 6.34

simple 3.23 3.18 3.79 3.68

smith-normal-form 2.98 3.09 16.11 16.3

tailfib 2.09 2.09 2.06 1.85

tak 2.07 2.09 2.05 1.95

tensor 2.58 2.58 2.91 2.97

tsp 2.3 2.34 2.5 2.6

tyan 2.8 2.78 3.17 3.17

vector-concat 2.1 2.07 2.1 1.98

vector-rev 2.09 2.08 2.01 1.95

vliw 5.3 4.98 6.55 6.66

wc-input1 2.42 2.4 2.68 2.67

wc-scanStream 2.48 2.46 2.84 2.78

zebra 2.72 2.8 3.17 3

zern 2.28 2.25 2.22 2.23

example 2.04 2.08 2.07 1.98

Table 4: Compile-Time of benchmark tests

30

MLton has a benchmark test suite with 43 tests. Three features are measured for all the tests like run-time of the program compiled, compile time of the program and the size of the binary generated after compilation. And these tests are run with four versions of the following compiler. ~/workspace/mlton/build/bin/mlton -drop-pass 'commonSubexp' -drop-pass 'gvnPre' ------ No Optimization ~/workspace/mlton/build/bin/mlton -drop-pass 'gvnPre' ------ CSE ~/workspace/mlton/build/bin/mlton -drop-pass 'commonSubexp' ------ GVN-PRE ~/workspace/mlton/build/bin/MLton ------ Both CSE & GVN-PRE

Results from Table 1 and 2 show the run-time ratio and their actual run times. From the results, it is clear that 24 out of 44 programs get speeded up with GVN-PRE alone or with both GVN-PRE and CSE. It raises the question of how CSE performs better in the remaining 20 programs. But the results also show for 13 programs the optimization actually worsen the performance. This might be due to various reasons of increasing the register pressure due to increasing the scope of the variables. Optimizing the program by eliminating less time taking operations like add or sub and increasing the register usage by a greater margin than before can result in poor performance of the resultant optimized program. 14 programs actually get better performance with only CSE. Analyzing the reason behind this weird results opened a whole scope for improving GVN-PRE. And the reasons were mainly due to canonize operation on the operands in the expressions by CSE. The expressions in CSE are re-structured based on the operands in the recent register loads. This way the load time of the operands in the memory is saved and improves the run-time of the programs. And previous work based on this behavior has shown up to 10% improvement in runtimes. This canonize function also was added to GVN-PRE as a part of experimentation but they are based on ordering the operands based on the hash value and the ordering with CSE is differently performed even though they are supposed to be same. One difference might be with the addition of statements and arguments in the code after GVN-PRE might impact the hash results in some way. One last experiment that still can be performed to prove that, it is the only reason behind the CSE’s better performance on the remaining programs is by eliminating the canonize function from CSE and then running the benchmark tests. Table 3 shows the size of the binary generated with all four versions of the compiler and almost all programs have a slight increase in size compared to CSE. Even though there is an increase in size but when looked at the granular level they are at most by 1Mb for a huge program like hamlet and on average they grow by 20-300Kbs. This behavior is as expected and can be answered based on the fact that new statements are added to the blocks which actually can result in adding more code than eliminating. For example, if there are four control paths and the partial redundancy is from one path, then for removing the code in the merging block, 3 times code is added in the missing paths. One way to handle this drawback could be to merge all the missing paths to a new landing pad and then adding code to this empty block once.

Compile time from Table 4 shows a significant increase with the addition of GVN-PRE. This can be further improved by limiting the open convergence loops in insert and BuildSets phase 2 by

31

limiting them to a certain number like 10 as proposed by VanDrunen[1]. The project has been implemented at the incremental level and this increase in compile time grew up to 10 folds after the addition of Const expressions as a part of optimization phase. As they being a very common expression can increase the number of expressions in all the lists like Antic_In and Avail_Out. And each iteration can consume tremendous increase in work by pushing these constants as high as possible in the dominator tree while convergence. One last thing to note is the data structure related to LabelInfo needs a drastic change. Use of lists even though has few benefits but are way slower compared to hashSets. Major changes related to hash and layout functions of VExp can simplify fasten frequent calls to them. The benchmark tests also when verified had very few partial redundancies in them and the results based on them does not give GVN-PRE an upper hand to excel. The benchmark test results from the example and hamlet programs are still promising and encouraging to extend the performance of GVN-PRE with additional changes to the existing infrastructure by trying all the proposed solutions so far along with LEPRE. There is a wide range of improvements that can still be tried on the existing work to improve the current implementation of GVN-PRE. The results from GVN-PRE[1] compared to the project are totally different and the main reason is VanDrunen compares GVN-PRE with GCSE. But the CSE pass in MLton is Global Value Numbering as mentioned before. Global Value Numbering itself is very powerful compared to CSE and the comparison of results between optimization passes in VanDrunen’s[1] work is entirely different than that being covered in this project’s experiments.

Chart 2: Run Time Ratio

0

0.2

0.4

0.6

0.8

1

1.2

Run Time Ratio

No Optimization CSE GVN-PRE Both CSE & GVN-PRE

32

Chart 3: Binary Size Ratio

Chart 4: Compile Time Ratio

0.920.930.940.950.960.970.980.99

11.01

Binary Size Ratio


02468

1012

Compile Time Ratio


33

9. Challenges There were a lot of challenges throughout the implementation of this optimization pass in MLton. During the example creation phase of creating an example with the right amount of complexity containing partial redundancy was itself time taking. And the next phase was to manually run the BuildSets phase 1 and 2 generating all the block related sets with multiple iterations until convergence took way more time and it started to get difficult to manage the corpus generated during this stage. And later found that the example generated itself did not have partial redundancy in it and a new example was created following the manual generation step again. Even though the thesis[1] described most part of the algorithm clearly but it still did not elaborate the complete details regarding the helper functions used in few of its stages which created lot of openness throughout the implementation and any error would arise the question of whether if it could be caused due to certain decisions made regarding those gaps. The doubt was cleared only after a couple of weeks after getting the hang of the working of GVN-PRE. The build of MLton is designed in such a way that after the compiler is compiled the MLton itself is self-compiled using the compiler generated along with all the related libraries dependent on it. Few times when the error arises during the self-compilation phase, it becomes really difficult to find the root cause of the error by going through huge SSA files and diagnostic files generated. Even if the error was found it creates the way for next one and it was kind of going in a circle. Most of these bugs were fixed through analyzing the code added recently and in between the previous version that compiled successfully. Moving forward with this approach the compile time slowing started to increase and attain 10 folds causing increased block time, were no actual productive work was being done within the time spent working towards the problem. It also creates the gap of forgetting the actual problem being dealt with at times. One lesson learned from this was to spend as much time as possible during the pre-processing step by creating modular and efficient infrastructure for the project at the very beginning as it was not viable after a stage to take a break for changing the code in a better manner to be more productive. Keeping the target of first getting things to work first helped a bit in the beginning but the initial laziness to update the infrastructure later kept magnifying the problems being dealt at the time of build failures. After the successful completion of GVN-PRE, the benchmark results were in favor of CSE and a lot of differences between the code optimized by CSE and GVN-PRE had to be manually verified. With each identification of mismatch, the above-mentioned problems would start following during their fix. Later it was decided to create a fresh example to compare the optimization passes as it was noticed as most of the benchmark tests did not have partial redundancy in them. This was a very interesting challenge as most examples created were not complex enough to run the tests with. All the challenges taught a new lesson each time and were made sure the same mistake was not repeated twice.

34

10. Conclusion The integration of GVN-PRE pass in MLton is the first step towards optimizing programs based on the aspect of partial redundancy elimination in MLton. The implementation is really very straight forward as mentioned by VanDrunen[1] compared to the previous approaches of SSAPRE[2]. The benefits of implementing this pass for Standard ML is more beneficial as ConApp and Tuple expressions being immutable. Even though the results from benchmark test suite is surprising with CSE performing better than GVN-PRE at times. There are a lot of factors causing this results which can be improved by addressing the features covered in future scope. The overall performance degrade can be directly dependent on the infrastructural choices made throughout the project and the intention of keeping the implementation simple. The major impact on the increase in compile time is one such component that had a direct impact due to this. The results from the Example 2 mentioned in Analyzing GVN-PRE section shows the actual performance of this pass. Achieving 3 times better runtime with GVN-PRE is very promising and motivating to continue the work on GVN-PRE. Overall with further investigation if the exact reason for the increase in runtime is identified then addressing it and tweaking the algorithm based on it would be straight forward. As the testing of this pass and fine tuning its performance was totally based on two examples manually created, there is definitely infinite possibility to improve the results in untested scenarios. A better way to tackle this drawback is by generating a test suite specific to examples having partial redundancies.

11. Future Scope GVN-PRE can be extended with many optimizations based on the eliminations performed by it. For example, if a code block contains nested case expressions performing a case of operation on the same expression. In this scenario, the second case of operation can be totally eliminated as the value is redundant from the previous path and the subsequent cases can as well be optimized by eliminating redundant computations and unreachable code through a path as this becomes a known case optimization problem. As mentioned in the results section the addition of new statements to predecessor blocks if more than one of them needs insertions, then they can all be handled effectively by having a common landing pad and carrying out the insertions in them. This fix can have significant improvements towards the decrease in binary size created. Few improvements covered in VanDrunen’s[1] works related to decreasing the register pressure with the use of LEPRE can be implemented as it is direct incremental work above GVN-PRE and they can be used to eliminate redundancies based on mutable complex data structures in Standard ML. Results from CSE greatly depend on canonizing the operands which in a way improves the program’s performance by avoiding unnecessary loads by effectively arranging the operands. But the problem when incorporated in GVN-PRE was, few of the canonizations were done incorrectly as the hash value of the operands are used to order them. With proper analysis, this issue can easily be fixed. The mandatory change of internal list usage to hashSet in LabelInfo and ValTable structure needs to be tested and improved to have a significant improvement in the time complexity of MLton’s GVN-PRE.

35

12. References [1] VanDrunen, Thomas John. "Partial Redundancy Elimination for Global Value Numbering." PhD diss., Purdue University, 2004. http://cs.wheaton.edu/~tvandrun/writings/thesis.pdf [2] Fred Chow, Sun Chan, Robert Kennedy, Shin-Ming Liu, Raymond Lo, and Peng Tu. A new algorithm for partial redundancy elimination based on SSA form. In Proceedings of the Conference on Programming Language Design and Implementation, pages 273–286, Las Vegas, Nevada, June 1997.

[3] MLton http://www.mlton.org/ [4] IBM Research. The Jikes Research Virtual Machine. http://www124.ibm.com/developerworks/oss/jikesrvm/.

http://www.mlton.org/

http://www124.ibm.com/developerworks/oss/jikesrvm/

Documents

Implementation of Partial Redundancy Elimination (PRE) for ... · on value equivalence. If the below statements from Fig 2 are part of the code to be optimized then the computation