32
DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation Mafijul Islam Department of Computer Science and Engineering November 19, 2009

DAT105: Computer Architecture Study Period 2, 2009 · 2009. 11. 17. · DAT105: Computer Architecture Study Period 2, 2009 Goals: To understand • basic pipeline scheduling and loop

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • DAT105: Computer ArchitectureStudy Period 2, 2009

    Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation

    Mafijul IslamDepartment of Computer Science and Engineering

    November 19, 2009

  • DAT105: Computer ArchitectureStudy Period 2, 2009

    Goals: To understand• basic pipeline scheduling and loop unrolling• the impact of control dependency on performance• register renaming and dynamic scheduling

    Case Studies/Assignments:• Assignment 2 of the Exam on 2007-12-20

    • Assignments 2, 3 of the Exam on 2005-12-12

    • Assignment 3 of the Exam on 2008-12-18

    • Assignment 3 of the Exam on 2006-12-22

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2

    LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP

    Assume:• MIPS processor with a 5-stage pipeline (presented in Appendix A of the textbook)• All memory accesses complete in a single cycle• There is one branch delay slot• Multiply operations are fully pipelined like all other arithmetic instructions, but the result is not available until the end of the Memory access stage

    for(int i=0; i

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(A)

    LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP

    • How many cycles does this code take to execute per loop of the originalprogram? • Specify all types of dependencies and unresolved hazards in this code

    # RAW dependency R4

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(A)

    LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP

    • How many cycles does this code take to execute per loop of the originalprogram? • Specify all types of dependencies and unresolved hazards in this code

    # RAW dependency R4# RAW dependency R5

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(A)

    LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP

    • How many cycles does this code take to execute per loop of the originalprogram? • Specify all types of dependencies and unresolved hazards in this code

    # RAW dependency R4# RAW dependency R5

    # RAW dependency R1

    one iteration takes 9 cycles data hazards cause one stall cycle each

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(B)

    • Modify the code to require as few clock cycles as possible• How many clock cycles does it take now?

    LOOP:LD R4, 0(R1)DMUL R5, R4, R4SD R5, 0(R1) DADDI R1, R1, 8BNE R1, R2, LOOPNOP

    LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4SD R5, 0(R1) BNE R1, R2, LOOPNOP

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(B)

    • Modify the code to require as few clock cycles as possible• How many clock cycles does it take now?

    LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4SD R5, 0(R1) BNE R1, R2, LOOPNOP

    LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOPSD R5, 0(R1)

    LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOPSD R5, -8(R1)

    complete elimination of the stall cycles one iteration now takes 5 cycles

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(C)

    • Try to modify the code further (n always is an even number)• How many clock cycles does it take now? • Specify any remaining hazard

    LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOP

    • unroll the loop once • make each iteration do the work of twoprevious iterations

    • merge the two DADDIs• rearrange further to avoid stall cycles

    SD R5, -8(R1)

  • DAT105: Computer ArchitectureExam 2007-12-20: Assignment 2(C)

    • Try to modify the code further (n always is an even number)• How many clock cycles does it take now? • Specify any remaining hazard

    LOOP:LD R4, 0(R1)DADDI R1, R1, 16DMUL R5, R4, R4LD R4, -8(R1)SD R5, -16(R1)DMUL R5, R4, R4 BNE R1, R2, LOOPSD R5, -8(R1)

    LOOP:LD R4, 0(R1)DADDI R1, R1, 8DMUL R5, R4, R4BNE R1, R2, LOOPSD R5, -8(R1)

    • complete elimination of stall cycles • the instructions corresponding to the old loop body now takes 4 cycles • one new iteration takes 8 cycles

  • DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3

    Assume that a floating-point load, division, and multiplication takes 10, 50, and 40 cycles, respectively

    LF F1, 0(R1)DIVF F4,F2, F1MULT F2, F1, F0

    Compute the execution time of the above sequence under the following assumptions on a processor that can issue :

    • one instruction per cycle and that has no register renaming capability• three instructions per cycle and that has no register renaming capability• three instructions per cycle and that has register renaming capability

    Disclaimer: If you feel that more assumptions have to be made, feel free to do so. If theyare needed and reasonable, they will be accepted without any deduction on the score

  • DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3(i)

    Compute execution time on a processor that can issue one instructionper cycle and that has no register renaming capability

    Identify the dependences• True data dependence• Name dependence

    LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles

    LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles

    Execution Time = 10 + 50 + 40 cycles = 100 cycles

  • DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3(ii)

    Compute execution time on a processor that can issue three instructionsper cycle and that has no register renaming capability

    Identify the dependences• True data dependence• Name dependence

    LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles

    LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles

    Execution Time = 10 + 50 + 40 cycles = 100 cycles

  • DAT105: Computer ArchitectureExam 2005-12-12: Assignment 3(iii)

    Compute execution time on a processor that can issue three instructionsper cycle and that has register renaming capability

    LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT F2, F1, F0 # 40 cycles

    Identify the dependences• True data dependence• Name dependence

    Apply register renaming:LF F1, 0(R1) # 10 cyclesDIVF F4,F2, F1 # 50 cyclesMULT S, F1, F0 # 40 cycles

    Cycle Issued Instruction1 LF F1, 0(R1) 11 DIVF F4,F2, F1 11 MULT S, F1, F0

    Execution Time = 10 + MAX(50, 40) cycles = 60 cycles

  • DAT105: Computer ArchitectureExam 2005-12-12: Assignment 2(A)

    Evaluating the impact of branch predictions on performance:

    Given:• every fifth instruction is a conditional branch• all conditional branches can be predicted with 100% accuracy, except for the “branch less than” category that can be predicted with only 50% accuracy• CPI=1 for all instructions including the branches that are correctly predicted• misprediction penalty is 10 cycles

    What is the CPI for the integer applications?

  • DAT105: Computer ArchitectureExam 2005-12-12: Assignment 2(A)

    Evaluating the impact of branch predictions on performance:

    Instruction mix:

    conditional branches: 20%

    “less than”: 35% of conditional branches (20%) for integers applications

    Relative occurrences of mispredicted “less than” branches

    = 0.20 * 0.35 *0.5 = 0.035

    Misprediction penalty: 10 cycles

    CPI : 1 for 80% of the instructions, 1 for the correctly predicted branches

    CPIoverall = 1 * ( 1 - 0.035 ) + 10 * 0.035 = 1.315

    mispredicted branchescorrectly predicted branches + other insts

  • DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)

    • The following MIPS program operates on an array with 64-bit elements. • The register R1 points to the beginning of the array from the beginning. • The register R2 points to the end. • The array always contains 1000 elements.

    ANDI R3, R3, 0 LOOP:

    LD R4, 0(R1)DMUL R5, R4, R4DADD R5, R3, R5SD R5, 0(R1) DADDI R3, R4, 0DADDI R1, R1, 8BNE R1, R2, LOOP

    How 2-bit branch prediction scheme works for the given code ?

  • DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)

    How 2-bit branch prediction scheme works for the given code ?

    For our example program we would have one entry corresponding to the BNE instruction in the end of the program. The prediction would evolve as follows if we assume we start in state “00”:

    Execution of BNE

    State before execution

    Prediction State after execution

  • DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)

    How 2-bit branch prediction scheme works for the given code ?

    The prediction evolves as follows if we assume we start in state “00”:

    Execution of BNE

    State before execution

    Prediction State after execution

    First 00 Not taken (wrong) 01 (taken)

  • DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)

    How 2-bit branch prediction scheme works for the given code ?

    The prediction evolves as follows if we assume we start in state “00”:

    Execution of BNE

    State before execution

    Prediction State after execution

    First 00 Not taken (wrong) 01 (taken)

    Second 01 Not taken (wrong) 11 (taken)

  • DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)

    How 2-bit branch prediction scheme works for the given code ?

    The prediction evolves as follows if we assume we start in state “00”:

    Execution of BNE

    State before execution

    Prediction State after execution

    First 00 Not taken (wrong) 01 (taken)

    Second 01 Not taken (wrong) 11 (taken)

    3rd, 4th,…,999th 11 Taken (correct) 11 (taken)

  • DAT105: Computer ArchitectureExam 2008-12-18: Assignment 3(A)

    How 2-bit branch prediction scheme works for the given code ?

    The prediction evolves as follows if we assume we start in state “00”:

    Execution of BNE

    State before execution

    Prediction State after execution

    First 00 Not taken (wrong) 01 (taken)

    Second 01 Not taken (wrong) 11 (taken)

    3rd, 4th, …,999th 11 Taken (correct) 11 (taken)

    1000th 11 Taken (wrong) 10 (Not taken)

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3

    Dynamic scheduling: Tomasulo’s Algorithm

    SUB F0, F1,F2 (8 cycles)DIV F3,F0,F4 (10 cycles)ADD F4, F5,F6 (6 cycles)

    3(A): Show all data and name dependences in the code

    3(B): Establish when the second addition instruction can start its executionAssume that

    • there are two addition functional units and a division functional unit• a single instruction is issued every cycle

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(A)

    Show all data and name dependences in the code

    SUB F0, F1,F2 (8 cycles)DIV F3, F0,F4 (10 cycles)ADD F4, F5,F6 (6 cycles)

    SUB F0, F1,F2 (8 cycles)DIV F3, F0,F4 (10 cycles)ADD F4, F5,F6 (6 cycles)

    • True data dependence• Name dependence

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 NOAdd2 NODiv NO

    ADD F4, F5, F6DIV F3, F0, F4SUB F0, F1, F2

    Write ResultExecuteIssueInstructionClock Cycle: 0 Instruction status

    QiF6F5F4F3F2F1F0Field

    Register status

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    NoADD F4, F5, F6NoDIV F3, F0, F4YesSUB F0, F1, F2

    Write ResultExecuteIssueInstructionClock Cycle: 1 Instruction status

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 Yes SUB Regs[F1] Regs[F2]Add2 NoDiv No

    Add1QiF6F5F4F3F2F1F0Field

    Register status

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    NoADD F4, F5, F6YesDIV F3, F0, F4

    YesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction

    Clock Cycle: 2 Instruction status

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 Yes SUB Regs[F1] Regs[F2]Add2 No Div Yes DIV Regs[F4] Add1

    DivAdd1QiF6F5F4F3F2F1F0Field

    Register status

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    YesADD F4, F5, F6YesDIV F3, F0, F4

    YesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction

    Clock Cycle: 3 Instruction status

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 Yes SUB Regs[F1] Regs[F2]Add2 Yes ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1

    Add2DivAdd1QiF6F5F4F3F2F1F0Field

    Register status

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    YesYesADD F4, F5, F6YesDIV F3, F0, F4

    YesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction

    Clock Cycle: 4 Instruction status

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 Yes SUB Regs[F1] Regs[F2]Add2 Yes ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1

    Add2DivAdd1QiF6F5F4F3F2F1F0Field

    Register status

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    YesYesYesADD F4, F5, F6YesDIV F3, F0, F4

    YesYesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction

    Clock Cycle: 10 Instruction status

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 No SUB Regs[F1] Regs[F2]Add2 No ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1

    Add2DivAdd1QiF6F5F4F3F2F1F0Field

    Register status

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    YesYesYesADD F4, F5, F6YesYesDIV F3, F0, F4

    YesYesYes SUB F0, F1, F2Write ResultExecuteIssueInstruction

    Clock Cycle: 11 Instruction status

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 No SUB Regs[F1] Regs[F2]Add2 No ADD Regs[F5] Regs[F6]Div Yes DIV Regs[F4] Add1

    -Div-QiF6F5F4F3F2F1F0Field

    Register status

  • DAT105: Computer ArchitectureExam 2006-12-22: Assignment 3(B)

    Establish when the second addition instruction can start its execution:

    YesYesYesADD F4, F5, F6YesYesYesDIV F3, F0, F4YesYesYes SUB F0, F1, F2

    Write ResultExecuteIssueInstructionClock Cycle: 21 Instruction status

    Reservation stationsName Busy Op Vj Vk Qj Qk A

    Add1 No SUB Regs[F1] Regs[F2]Add2 No ADD Regs[F5] Regs[F6]Div No DIV Regs[F4] Add1

    -Div-QiF6F5F4F3F2F1F0Field

    Register status

    Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation