Differential Register Allocation - Amazon S3 · Register field size limits the number of exposed registers renaming registers is not desirable in the embedded world. We will use a

Differential Register Allocation If we are going to use more registers, we will need to have more bits to encode them in the instructions. We can only expose 32 (or rarely 64 ) registers. Register Encoding Dilemma Register field size limits the number of exposed registers renaming registers is not desirable in the embedded world. We will use a technique called Differential Register Allocation which means more registers are exposed to the compiler. Register field constitutes a big portion of the total code size. ARM == about 25% of the total code size is due to the register field. Even in high performance processors like Alpha, 28% is dedicated to register field. Renaming → map the renamed registers onto the architected registers. Renaming costs power, hardware, and adds complexity. In the embedded world the ISA imposes a bottleneck → the operand fields limit the size of the register field, so the number of exposed registers is limited. Since in embedded systems the number of exposed registers is the number of total registers, this is a serious limitation. Thumb 16 registers, 8 exposed SC110 16 registers, the high 8 or the low 8 are exposed at a time. Register Encoding Quiz: For highregister pressure regions, more registers will decrease the number of memory spills. More exposed registers will improve compiler optimizations.

Encode Differences Instead of putting in the register number, put in the difference. This will expose more registers.

For example: In this example, 8 registers require 3 bits to encode the names (0 7 0 111). If we encode the differences, we can reduce the number of bits to 2. The difference is used to calculate the new register number by using the previous register number and the

difference. There is a problem though …. what happens with this case? We get 6 as a difference. To do this we need to increase the number of bits. So this won’t work.

Modulo Differences Can use Modulo Differences to avoid negatives. It cost an extra bit. Use modified modulo arithmetic

We want to go from Rm to Rn. To account for negative numbers we need to use the

case n<m. RegN == the total number of registers. This will make the difference always positive. Using the modulo method, redo the previous example.

To go from R1 to R4 …. 4 1 = 3. So need two bits. To go from R7 to R1 the previous example led to 6.

But with modulo method we will get the number 2 (1 7 + 8 = 2). So we still just need 2 bits. So modulo difference is really just : the number of hops from one register number to the other in a clockwise direction.

Differential Encoding Encode the modulo difference between the current register number and the previous register number. This eliminates the issue of negative numbers. Means we can use fewer number of bits (or can say if we have a certain number of bits in the ISA we can encode more registers). For example with 2 bits we can only encode 4 registers under the name scheme. Under the modulo scheme we can encode 8 registers. Differential Encoding Justifications Goal: save encoding space Method: use less bits to encode differences than total number of registers.

Worst Case Differences Are two bits always sufficient to encode 8 registers? What about this case?

The difference between going from register 0 to register 4 leads to a difference of 4. This requires 3 bits, for both direct and difference encoding.

There will always be certain differences that will need more bits. So we need to introduce a new instruction: set_last_reg(REG Number) So in the above example we will set the previous register number to 4, which will mean 44 = 0. This can be represented by 2 bits. There is a cost for set_last_reg() , but it is cheap compared to using another bit for register names. So the rule now is: if we can encode with modulo differences we will use it. If we can’t, we will set_last_reg(). When we use set_last_reg() the difference will be 0.

Multiple Path Inconsistency What happens when we use it in a program? First issue Multiple Path Inconsistencies Multiple paths to a join point could lead to different differences.

This is multiple path inconsistency. The two branches join at an instruction that uses R2. But one path uses R1 and the other uses R2. This will result in different previous registers when calculating the register difference. One will have a difference of 1 and the other will have a difference of 0. To solve this problem we can do the following: On the most taken branch we can do the modulo difference encoding. On the other branch we will put in a set_last_reg() command stating the last register used.

Hardware Implementation for Decoding Decoding is performed at runtime to restore register numbers from differences in the register fields. Add difference to the last accessed register to get the current one. Example: We have 8 registers R0 to R7

In this example, the last register used is 3. The instruction says the mod difference is 1. Therefore the decoded register to be used is 4.

Now we have an instruction that says the mod difference is 3, the last reg used is 4, so the decoded register is 7.

In this example, the last register used is 7 and the mod difference is 2, so the decoded register is 1. (remember 7 + 2 = 9, but we have 07 registers. So when we go around we have to pass the zero register to arrive at the 1 register).

Hardware needs: Need a narrow register to hold the last register accessed (last_reg) Adders to deal with the short integers used to calculate the decoded registers. 128

registers will use 7 bits. Decoding can be performed in parallel while the opcodes is being decoded. We can even parallelize the decoding of the registers.

Integration into Register Allocation Given all the compiler techniques we have learned, can we use them with the differential register allocation scheme? The methods are:

Begin with any register allocator → and change to differential remapping. Modify a graph coloring register allocator with a differential select mechanism. Begin with an optimal spilling register allocator and use differential coalesce and differential select.

Adjacency Graph Adjacency graph → Captures sequence of register accesses this will show us how we go from one register to another. It is constructed along with the interference graph In the graph each virtual register is a node Edge weights represent frequency of adjacent register accesses. The edge weight is the frequency of the adjacent register accesses.

In the directed graph the v0 to v1 transition occurs twice, so there is a weight of 2 in that direction. And a weight of a 1 in the other direction for the one transaction the other way. When a transition goes from the same register (for example v1 to v1 ) it does not have to be shown on the graph.

Edge Coverage and Cost Calculation We say an edge is covered if the modulo difference of the two end nodes can be encoded after register allocation. Recall that not all the differences can be covered under the fixed bit modulo method. If the difference is captured then the edge is considered to be “covered”. Cost = weight of uncovered edges are the cost. This is also known as the number of set_last_reg instructions needed to handle differences that cannot be fit in the register field.

Example: regN = 3 DiffN = 2 The register field is then 1 bit wide.

WE can only encode 0 and 1 differences. As you can see there are two places where the difference is negative one. For these places we will have to use set_last_reg.

Let’s now look at it as an access graph:

The cost will be 2, for the two uncovered transitions. So we will need 2 set_last_reg() instructions.

Differential Remapping

Assign registers to node to cover edge weights maximally on the adjacency graph. Perform Global remapping (permute) the register numbers after the register allocation

(postpass) is done to maximize weighted coverage.

Example

In this example, we are using only one register. When we try to use differential encoding we find we can’t do either transition, so we have a cost of 2. The worst case of

transition is R2 R1 R0. Can remapping help? When we do remapping, let’s keep R0 the same and change the other two registers:

When we do this, we get a cost of 0.

In this case: Global remapping preserves semantics if interference

Differential Remapping Algorithm

Begin with the given information:

Step 1: RV = initial register vector from the code after register allocation Cost = find the cost for RV from the adjacency graph Step 2: Find a pair of elements in RV to

swap so we can get the maximal cost reduction Different swaps will generate different costs. Step 3: Did the swap produce a cost reduction? if yes, then keep the swap. Eventually, we will have reached the max cost.

Differential Select How can we modify the differential select algo

1. Pop a virtual register from the stack 2. Insert it into both an interference graph and the adjacency graph 3. Recover the edges to the neighbors that have been inserted prior to this node 4. Check the interference graph to get a number of register numbers (Colors) that are not

used by its neighbors. If no available register number (color), the register allocator will handle that as spill.

5. For each allowed register number, find extra cost that will be incurred if the node is assigned this register number.

6. Pick the one with the minimal cost The result: Amongst available colors the one that covers maximum edge weights in the adjacency graph is the one that is picked Differential Coalesce Based on an optimal spilling register (famous alog → Appel and George PLDI 2001) allocator with 2 stages. It generates spills optimally and coalesces move instructions aggressively. (Coalescing is: merging two subranges to avoid registertoregister moves.) Used this algorithm to see if differential register allocation can be used.

More registers are exposed and used → so spills are significantly reduced in stage 1 (optimal spills stage)

We need to tackle both move and set_last_reg instructions in stage 2 (move

coalescence stage) This makes it more sensitive to the set_last_reg instructions that are being inserted.

Integrate differential select into stage 2 to search for a solution which reduces set_last_Reg instructions as well.

Applications What other applications can benefit from diff reg alloc?

software pipelining increases register pressure due to optimizations like “unroll and jam”. This is because there are a large number of variables that are live at the same time. Register pressure has been a bottleneck in software pipelining. In this case we apply differential remapping after the original register allocation is done.

selectively enable diff encoding: at high register pressure regions even if processors have ample number of

registers turn diff reg all on and off according to register pressure.

Evaluation Through Process SImulation Using a lowend processor with tight encoding space: This is done by modifying a processor so simulation is used. The first processor used:

5 stage inorder issue ARM/THUMB machine 16 physical registers and 3 bits in the register field tested 10 benchmarks from Mibench (well known benchmark software) modified the GCC compiler

put differential register allocation on top of iterated register allocation optimal spill register allocation

The second processor used:

high performance processor VLIW machine model with 5 bit register field and 64 physical registers a large number of loops from SPEC benchmarks were executed

Overall, the diff reg allocation resulted in better performance with trivial hardware costs.

Documents

Differential Register Allocation - Amazon S3 · Register field size limits the number of exposed registers renaming registers is not desirable in the embedded world. We will use a