Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
DAT105: Computer ArchitectureStudy Period 2, 2009
Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation
Mafijul IslamDepartment of Computer Science and Engineering
November 19, 2009
DAT105: Computer ArchitectureStudy Period 2, 2009
Goals: To understand• basic pipeline scheduling and loop unrolling• the impact of control dependency on performance• register renaming and dynamic scheduling
Case Studies/Assignments:• Assignment 2 of the Exam on 2007-12-20
• Assignments 2, 3 of the Exam on 2005-12-12
• Assignment 3 of the Exam on 2008-12-18
• Assignment 3 of the Exam on 2006-12-22
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2
LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP
Assume:• MIPS processor with a 5-stage pipeline (presented in Appendix A of the textbook)• All memory accesses complete in a single cycle• There is one branch delay slot• Multiply operations are fully pipelined like all other arithmetic instructions, but the result is not available until the end of the Memory access stage
for(int i=0; i
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(A)
LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP
• How many cycles does this code take to execute per loop of the originalprogram? • Specify all types of dependencies and unresolved hazards in this code
# RAW dependency R4
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(A)
LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP
• How many cycles does this code take to execute per loop of the originalprogram? • Specify all types of dependencies and unresolved hazards in this code
# RAW dependency R4# RAW dependency R5
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(A)
LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP
• How many cycles does this code take to execute per loop of the originalprogram? • Specify all types of dependencies and unresolved hazards in this code
# RAW dependency R4# RAW dependency R5
# RAW dependency R1
one iteration takes 9 cycles data hazards cause one stall cycle each
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(B)
• Modify the code to require as few clock cycles as possible• How many clock cycles does it take now?
LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP
LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4SD R5, 0(R1) BNE R1, R2, LOOPNOP
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(B)
• Modify the code to require as few clock cycles as possible• How many clock cycles does it take now?
LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4SD R5, 0(R1) BNE R1, R2, LOOPNOP
LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOPSD R5, 0(R1)
LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOPSD R5, -8(R1)
complete elimination of the stall cycles one iteration now takes 5 cycles
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(C)
• Try to modify the code further (n always is an even number)• How many clock cycles does it take now? • Specify any remaining hazard
LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOP
• unroll the loop once • make each iteration do the work of twoprevious iterations
• merge the two DADDIs• rearrange further to avoid stall cycles
SD R5, -8(R1)
DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(C)
• Try to modify the code further (n always is an even number)• How many clock cycles does it take now? • Specify any remaining hazard
LOOP:LD R4, 0(R1)DADDI R1, R1, 16DMUL R5, R4, R4LD R4, -8(R1)SD R5, -16(R1)DMUL R5, R4, R4 BNE R1, R2, LOOPSD R5, -8(R1)
LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOPSD R5, -8(R1)
• complete elimination of stall cycles • the instructions corresponding to the old loop body now takes 4 cycles • one new iteration takes 8 cycles
DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3
Assume that a floating-point load, division, and multiplication takes 10, 50, and 40 cycles, respectively
LF F1, 0(R1)DIVF F4,F2, F1MULT F2, F1, F0
Compute the execution time of the above sequence under the following assumptions on a processor that can issue :
• one instruction per cycle and that has no register renaming capability• three instructions per cycle and that has no register renaming capability• three instructions per cycle and that has register renaming capability
Disclaimer: If you feel that more assumptions have to be made, feel free to do so. If theyare needed and reasonable, they will be accepted without any deduction on the score
DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3(i)
Compute execution time on a processor that can issue one instructionper cycle and that has no register renaming capability
Identify the dependences• True data dependence• Name dependence
LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles
LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles
Execution Time = 10 + 50 + 40 cycles = 100 cycles
DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3(ii)
Compute execution time on a processor that can issue three instructionsper cycle and that has no register renaming capability
Identify the dependences• True data dependence• Name dependence
LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles
LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles
Execution Time = 10 + 50 + 40 cycles = 100 cycles
DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3(iii)
Compute execution time on a processor that can issue three instructionsper cycle and that has register renaming capability
LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles
Identify the dependences• True data dependence• Name dependence
Apply register renaming:LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT S, F1, F0 # 40 cycles
Cycle Issued Instruction1 LF F1, 0(R1) 11 DIVF F4,F2, F1 11 MULT S, F1, F0
Execution Time = 10 + MAX(50, 40) cycles = 60 cycles
DAT105: Computer ArchitectureExam 2005-12-12: Assignment 2(A)
Evaluating the impact of branch predictions on performance:
Given:• every fifth instruction is a conditional branch• all conditional branches can be predicted with 100% accuracy, except for the “branch less than” category that can be predicted with only 50% accuracy• CPI=1 for all instructions including the branches that are correctly predicted• misprediction penalty is 10 cycles
What is the CPI for the integer applications?
DAT105: Computer ArchitectureExam 2005-12-12: Assignment 2(A)
Evaluating the impact of branch predictions on performance:
Instruction mix:
conditional branches: 20%
“less than”: 35% of conditional branches (20%) for integers applications
Relative occurrences of mispredicted “less than” branches
= 0.20 * 0.35 *0.5 = 0.035
Misprediction penalty: 10 cycles
CPI : 1 for 80% of the instructions, 1 for the correctly predicted branches
CPIoverall = 1 * ( 1 - 0.035 ) + 10 * 0.035 = 1.315
mispredicted branchescorrectly predicted branches + other insts
DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)
• The following MIPS program operates on an array with 64-bit elements. • The register R1 points to the beginning of the array from the beginning. • The register R2 points to the end. • The array always contains 1000 elements.
ANDI R3, R3, 0 LOOP:
LD R4, 0(R1)DMUL R5, R4, R4DADD R5, R3, R5SD R5, 0(R1) DADDI R3, R4, 0DADDI R1, R1, 8BNE R1, R2, LOOP
How 2-bit branch prediction scheme works for the given code ?
DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)
How 2-bit branch prediction scheme works for the given code ?
For our example program we would have one entry corresponding to the BNE instruction in the end of the program. The prediction would evolve as follows if we assume we start in state “00”:
Execution of BNE
State before execution
Prediction State after execution
DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)
How 2-bit branch prediction scheme works for the given code ?
The prediction evolves as follows if we assume we start in state “00”:
Execution of BNE
State before execution
Prediction State after execution
First 00 Not taken (wrong) 01 (taken)
DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)
How 2-bit branch prediction scheme works for the given code ?
The prediction evolves as follows if we assume we start in state “00”:
Execution of BNE
State before execution
Prediction State after execution
First 00 Not taken (wrong) 01 (taken)
Second 01 Not taken (wrong) 11 (taken)
DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)
How 2-bit branch prediction scheme works for the given code ?
The prediction evolves as follows if we assume we start in state “00”:
Execution of BNE
State before execution
Prediction State after execution
First 00 Not taken (wrong) 01 (taken)
Second 01 Not taken (wrong) 11 (taken)
3rd, 4th,…,999th 11 Taken (correct) 11 (taken)
DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)
How 2-bit branch prediction scheme works for the given code ?
The prediction evolves as follows if we assume we start in state “00”:
Execution of BNE
State before execution
Prediction State after execution
First 00 Not taken (wrong) 01 (taken)
Second 01 Not taken (wrong) 11 (taken)
3rd, 4th, …,999th 11 Taken (correct) 11 (taken)
1000th 11 Taken (wrong) 10 (Not taken)
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3
Dynamic scheduling: Tomasulo’s Algorithm
SUB F0, F1,F2 (8 cycles)DIV F3,F0,F4 (10 cycles)ADD F4, F5,F6 (6 cycles)
3(A): Show all data and name dependences in the code
3(B): Establish when the second addition instruction can start its executionAssume that
• there are two addition functional units and a division functional unit• a single instruction is issued every cycle
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(A)
Show all data and name dependences in the code
SUB F0, F1,F2 (8 cycles)DIV F3, F0,F4 (10 cycles)ADD F4, F5,F6 (6 cycles)
SUB F0, F1,F2 (8 cycles)DIV F3, F0,F4 (10 cycles)ADD F4, F5,F6 (6 cycles)
• True data dependence• Name dependence
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 NOAdd2 NODiv NO
ADD F4, F5, F6DIV F3, F0, F4SUB F0, F1, F2
Write ResultExecuteIssueInstructionClock Cycle: 0 Instruction status
QiF6F5F4F3F2F1F0Field
Register status
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
NoADD F4, F5, F6NoDIV F3, F0, F4YesSUB F0, F1, F2
Write ResultExecuteIssueInstructionClock Cycle: 1 Instruction status
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 Yes SUB Regs[F1] Regs[F2]Add2 NoDiv No
Add1QiF6F5F4F3F2F1F0Field
Register status
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
NoADD F4, F5, F6YesDIV F3, F0, F4
YesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction
Clock Cycle: 2 Instruction status
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 Yes SUB Regs[F1] Regs[F2]Add2 No Div Yes DIV Regs[F4] Add1
DivAdd1QiF6F5F4F3F2F1F0Field
Register status
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
YesADD F4, F5, F6YesDIV F3, F0, F4
YesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction
Clock Cycle: 3 Instruction status
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 Yes SUB Regs[F1] Regs[F2]Add2 Yes ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1
Add2DivAdd1QiF6F5F4F3F2F1F0Field
Register status
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
YesYesADD F4, F5, F6YesDIV F3, F0, F4
YesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction
Clock Cycle: 4 Instruction status
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 Yes SUB Regs[F1] Regs[F2]Add2 Yes ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1
Add2DivAdd1QiF6F5F4F3F2F1F0Field
Register status
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
YesYesYesADD F4, F5, F6YesDIV F3, F0, F4
YesYesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction
Clock Cycle: 10 Instruction status
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 No SUB Regs[F1] Regs[F2]Add2 No ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1
Add2DivAdd1QiF6F5F4F3F2F1F0Field
Register status
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
YesYesYesADD F4, F5, F6YesYesDIV F3, F0, F4
YesYesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction
Clock Cycle: 11 Instruction status
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 No SUB Regs[F1] Regs[F2]Add2 No ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1
-Div-QiF6F5F4F3F2F1F0Field
Register status
DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)
Establish when the second addition instruction can start its execution:
YesYesYesADD F4, F5, F6YesYesYesDIV F3, F0, F4YesYesYes SUB F0, F1, F2
Write ResultExecuteIssueInstructionClock Cycle: 21 Instruction status
Reservation stationsName Busy Op Vj Vk Qj Qk A
Add1 No SUB Regs[F1] Regs[F2]Add2 No ADD Regs[F5] Regs[F6]Div No DIV Regs[F4] Add1
-Div-QiF6F5F4F3F2F1F0Field
Register status
Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation