Upload
shanon-perry
View
217
Download
2
Embed Size (px)
Citation preview
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
1
Compiler Techniques for ILP
So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling
Scoreboard Tomasulo’s algorithm
Speculation Multiple issue
How can compilers help?
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
2
Loop Unrolling Let’s look at the code:
for (i=1000;i>0;i=i-1)x[i] = x[i] + s
ADD R2,R0,R0Loop: L.D F0,0(R1)
ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
3
Scheduling On A Simple 5 Stage MIPS
Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty
10 cycles
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
4
We Could Rearrange The InstructionsLoop: L.D F0,0(R1)
stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty
Interleavethese inst. with someindependentinst.Best we canachieve is 6
6 cycles
Loop: L.D F0,0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, Loop8
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
5
Loop Unrolling Getting into the loop more
useful instructions and reducing overhead Step 1: Put several iterations together
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, Loop
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop Assume taken
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
6
Loop Unrolling Step 2: Take out control instructions, adjust offsets
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, LoopL.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) DADDUI R1, R1, #-8BNE R1, R2, Loop
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) L.D F0,-8(R1)ADD.D F4, F0, F2S.D F4, -8(R1) L.D F0,-16(R1)ADD.D F4, F0, F2S.D F4, -16(R1) L.D F0,-24(R1)ADD.D F4, F0, F2S.D F4, -24(R1) DADDUI R1, R1, #-32BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
7
Loop Unrolling Step 3: Rename registers
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1) L.D F0,-8(R1)ADD.D F4, F0, F2S.D F4, -8(R1) L.D F0,-16(R1)ADD.D F4, F0, F2S.D F4, -16(R1) L.D F0,-24(R1)ADD.D F4, F0, F2S.D F4, -24(R1) DADDUI R1, R1, #-32BNE R1, R2, Loop
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
8
Loop Unrolling Current loop still has stalls due to RAW
dependencies
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, Loop
Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loopstall one cycle, branch penalty
28 cycles = 7 per it.
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
9
Loop Unrolling Step 4: Interleave iterations
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, Loop
14 cycles = 3.5 per it.
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2S.D F4, 0(R1)S.D F8, -8(R1)DADDUI R1, R1, #-32S.D F12, 16(R1)BNE R1, R2, LoopS.D F16, 8(R1)
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
10
Loop Unrolling + Multiple Issue Let’s unroll the loop 5 times, mark int. and FP operations
Loop: L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6,-8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10,-16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14,-24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)L.D F18,-32(R1)ADD.D F20, F18, F2S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
11
Loop Unrolling + Multiple Issue Move all loads first, then ADD.D then S.D
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
12
Loop Unrolling + Multiple Issue Rearrange instructions to handle delay for DADDUI and
BNELoop: L.D F0,0(R1)
L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, Loop
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, -24(R1)BNE R1, R2, LoopS.D F20, -32(R1)
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
13
Loop Unrolling + Multiple Issue Fix immediate displacement values
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, 16(R1)BNE R1, R2, LoopS.D F20, 8(R1)
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
14
Loop Unrolling + Multiple Issue Now imagine we can issue 2 instructions per cycle, one
integer and one FP
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)DADDUI R1, R1, #-40S.D F16, 16(R1)BNE R1, R2, LoopS.D F20, 8(R1)
123
3
4
4
5
56
67
789101112
12 cycles = 2.4 per it.
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
15
Static Branch Prediction Analyze the code, figure out which outcome of a branch
is likely Always predict taken
Predict backward branches as taken, forward as not taken
Predict based on the profile of previous runs
Static branch prediction can help us schedule delayed branch slots
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
16
Static Multiple Issue: VLIW Hardware checking for dependencies in issue packets
may be expensive and complex Compiler can examine instructions and decide which ones can
be scheduled in parallel – group instructions into instruction packets – VLIW
Hardware can then be simplified
Processor has multiple functional units and each field of the VLIW is assigned to one unit
For example, VLIW could contain 5 fields and one has to contain ALU instruction or branch, two have to contain FP instructions and two have to be memory references
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
17
Example Assume VLIW contains 5 fields: ALU instruction or
branch, two FP instructions and two memory references
Ignore branch delay slot
Memory reference
Memory reference
FP instruction
ALU instruction
ALU instruction
Loop: L.D F0,0(R1)stall, wait for F0 value to propagateADD.D F4, F0, F2stall, wait for FP add to be completedstall, wait for FP add to be completedS.D F4, 0(R1)DADDUI R1, R1, #-8stall, wait for R1 value to propagateBNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
18
Example Unroll seven times and rearrange
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
1
ALU /branch FP FP mem mem
3
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
19
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
2
ALU /branch FP FP mem mem
3
4
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
20
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
3
3
ALU /branch FP FP mem mem
4
6
5
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
21
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
4
4
ALU /branch FP FP mem mem
7
6
5
6
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
22
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
5
ALU /branch FP FP mem mem
7
6
6
8
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
23
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, -32(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
6
6
ALU /branch FP FP mem mem
7
9
8
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
24
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, 24(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
7
7
ALU /branch FP FP mem mem
9
8
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
25
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, 24(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
8
8
ALU /branch FP FP mem mem
9
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
26
Example
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)L.D F18,-32(R1)L.D F22,-40(R1)L.D F26,-48(R1)ADD.D F4, F0, F2ADD.D F8, F6, F2ADD.D F12, F10, F2ADD.D F16, F14, F2ADD.D F20, F18, F2ADD.D F24, F22, F2ADD.D F28, F26, F2
S.D F4, 0(R1)S.D F8, -8(R1)S.D F12, -16(R1)S.D F16, -24(R1)S.D F20, 24(R1)DADDUI R1, R1, #-56S.D F24, 16(R1)BNE R1, R2, LoopS.D F28, 8(R1)
9
Overall 9 cycles for 7 iterations 1.29 per iterationBut VLIW was always half-full
ALU /branch FP FP mem mem
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
27
Detecting and Enhancing Loop Level Parallelism
Determine whether data in later iterations depends on data in earlier iterations – loop-carried dependence
Easier detected at source code level than at machine code for(i=1; i<=100; i=i+1){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1] /* S2 */}
S1 calculates a value A[i+1] which will be used in next iteration of S1S2 calculates a value B[i+1] which will be used in next iteration of S2 This is a loop-carried dependence and prevents parallelismS1 calculates a value A[i+1] which will be used in the current iteration of S2 This is dependence within the loop
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
28
Detecting and Enhancing Loop Level Parallelism
for(i=1; i<=100; i=i+1){ A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i] /* S2 */}
S1 calculates a value A[i] which is not used in the futureS2 calculates a value B[i+1] which will be used in next iteration of S1 This is a loop-carried dependence but S1 depends on S2 not on itself and S2 does not depend on S1This loop can be made parallel if we transform it so that there is no loop-carried dependence
A[1] = A[1] + B[1]; for(i=1; i<=99; i=i+1){ B[i+1] = C[i] + D[i] /* S2 */ A[i+1] = A[i+1] + B[i+1]; /* S1 */} B[101] = C[100]+D[100]
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
29
Detecting and Enhancing Loop Level Parallelism
Recursion creates loop-carried dependence
But sometimes it may parallelizable if distance between dependent elements is >1
for(i=1; i<=100; i=i+1){ A[i] = A[i-1] + B[i];}
for(i=1; i<=100; i=i+1){ A[i] = A[i-5] + B[i];}
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
30
Detecting and Enhancing Loop Level Parallelism
Find all dependencies in the following loop (5) and eliminate as many as you can:
for(i=1; i<=100; i=i+1){ Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c – Y[i]; /* S4 */}
Solution at page 325
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
31
Code Transformation
Eliminating dependent computations Copy propagation
Tree height reduction
DADDUI R1, R2, #4DADDUI R1, R1, #4 DADDUI R1, R2, #8
ADD R1, R2, R3ADD R4, R1, R6ADD R8, R4, R7
ADD R1, R2, R3ADD R4, R6, R7ADD R8, R1, R4
Can be done in parallel
sum=sum+x /* suppose this is in a loop and we unroll
it 5 times */
sum=sum+x1+x2+x3+x4+x5sum=(sum+x1)+(x2+x3)+(x4+x5)
Can be done in parallel
Must be done sequentially
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
32
Software Pipelining
Combining instructions from different loop iterations to separate dependent instructions within an iteration
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
33
Software Pipelining
Apply software pipelining technique to the following loop:
L.D F0,0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)
L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)
R1+16 R1+8 R1
168
S.D F0,16(R1)ADD.D F4, F0, F2L.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, Loop
Startup code
Cleanup code
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
34
Software Pipelining vs. Loop Unrolling
Loop unrolling eliminates loop maintenance overhead exposing parallelism between iterations Creates larger code
Software pipelining enables some loop iterations to run at top speed by eliminating RAW hazards that create latencies within iteration Requires more complex transformations
CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04
35
Homework #8 Due Tuesday, November 16 by the end of the class
Submit either in class (paper) or by E-mail (PS or PDF only) or bring the paper copy to my office
Do exercises 4.2, 4.6, 4.9 (skip parts d. and e.), 4.11