Computer Architecture Elementary Pipelining Study

Embed Size (px)

DESCRIPTION

Computers are essentially very fast idiots, thus, it becomes necessary to understand their idiosyncrasies, and augment their shortcomings with human intelligence.

Citation preview

  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 1 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    6.1

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 2 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    6.1 [M] Consider the following instructions at the given addresses in the memory:

    1000 Add R3, R2, #20

    1004 Subtract R5, R4, #3

    1008 And R6, R4, #0x3A

    1012 Add R7, R2, R4

    Initially, and . These instructions are executed in a computer that has a five-stage

    pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining

    instructions are fetched in successive cycles.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 3 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    (a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the

    pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1

    through 8.

    --

    (6.2.a) Flow Of Instructions : *Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle

    Stage: 1 2 3 4 5 6 7 8

    Operation:

    1000 Add R3, R2,

    #20

    Fetch IR (1000)

    Decode RA [R2] RB [R3]

    Compute RZ [R2]+

    #20

    Memory RY [RZ]

    Write R3 [RY]

    1004 Subtract R5, R4, #3

    Fetch

    IR (1004)

    Decode RA [R4] RB [R5]

    Compute RZ [R4]-

    #3

    Memory RY [RZ]

    Write R5 [RY]

    1008 And R6, R4, #0x3A

    Fetch

    IR (1008)

    Decode RA [RZ]

    RB [R6]

    Compute RZ

    [R4]x3A

    Memory RY [RZ]

    Write R6 [RY]

    1012 Add R7, R2, R4

    Fetch

    IR (1012)

    Decode RA [R2] RB [R4]

    Compute RZ [R2]+ [R4]

    Memory RY [RZ]

    Write R7 [RY]

    (b) With reference to Figures 5.8 and 5.9, describe the contents of registers R2, R3, R4, R5, R6, R7, IR,

    PC, RA, RB, RY, and RZ in the pipeline during cycles 2 to 8.

    --

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 4 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    (6.1) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ :

    Stage: 1 2 3 4 5 6 7 8 9

    Register:

    [PC] 1000 1004 1008 1012 ? ? ? ? ?

    [IR] Add

    (1004) Subtract (1004)

    And (1008)

    Add (1012)

    ? ? ? ? ?

    [RA] ? [R2] 2000

    [R4] 50

    [R4] 50

    [R2] 2000

    ? ? ? ?

    [RB] ? [R3]

    ? [R5]

    ? [R6]

    ? [R4] 50

    ? ? ? ?

    [RZ] ? ? ? [R2]+#20

    2020 [R4]-#3

    47 [R4] & #0x3A

    50 [R2]+ [R4]

    2050 ? ?

    [RY] ? ? ? ? [R2]+#20

    2020 [R4]-#3

    47 [R4] & #0x3A

    50 [R2]+ [R4]

    2050 ?

    [R2] 2000 2000 2000 2000 2000 2000 2000 2000 2000

    [R3] ? ? ? ? ? [R2]+#20

    2020 [R2]+#20

    2020 [R2]+#20

    2020 [R2]+#20

    2020

    [R4] 50 50 50 50 50 50 50 50 50

    [R5] ? ? ? ? ? ? [R4]-#3

    47 [R4]-#3

    47 [R4]-#3

    47

    [R6] ? ? ? ? ? ? ? [R4] & #0x3A

    50 [R4] & #0x3A

    50

    [R7] ? ? ? ? ? ? ? ? [R2]+ [R4]

    2050

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 5 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    6.2

    6.2 [M] Repeat Problem 6.1 for the following program:

    1000 Add R3, R2, #20

    1004 Subtract R5, R4, #3

    1008 And R6, R3, #0x3A // Data Dependency R3!!!

    1012 Add R7, R2, R4

    Assume that the pipeline provides forwarding paths to the ALU from registers RY and RZ

    in Figure 5.8 and that the processor uses forwarding of operands.

    --

    Initially, and . These instructions are executed in a computer that has a five-stage

    pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining

    instructions are fetched in successive cycles.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 6 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    (a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the

    pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1

    through 8.

    --

    (6.2.a) Flow Of Instructions : *Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle

    Stage: 1 2 3 4 5 6 7 8

    Operation:

    1000 Add R3, R2,

    #20

    Fetch IR (1000)

    Decode RA [R2] RB [R3]

    Compute RZ [R2]+

    #20

    Memory RY [RZ]

    Write R3 [RY]

    1004 Subtract R5, R4, #3

    Fetch

    IR (1004)

    Decode RA [R4] RB [R5]

    Compute RZ [R4]-

    #3

    Memory RY [RZ]

    Write R5 [RY]

    1008 And R6, R3, #0x3A

    Fetch

    IR (1008) Decode

    RA [RZ]

    RB [R6]

    Compute RZ

    [R3]x3A

    Memory RY [RZ]

    Write R6 [RY]

    1012 Add R7, R2, R4

    Fetch

    IR (1012)

    Decode RA [R2] RB [R4]

    Compute RZ [R2]+ [R4]

    Memory RY [RZ]

    Write R7 [RY]

    (b) With reference to Figures 5.8 and 5.9, describe the contents of R2, R3, R4, R5, R6, R7, IR, PC, RA,

    RB, RY, and RZ in the pipeline during cycles 2 to 8.

    --

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 7 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    (6.2.b) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ :

    Stage: 1 2 3 4 5 6 7 8 9

    Register:

    [PC] 1004 1004 1008 1012 ? ? ? ? ?

    [IR] Add

    (1000) Subtract (1004)

    And (1008)

    Add (1012)

    ? ? ? ? ?

    [RA] ? [R2] 2000

    [R4] 50

    2020

    [R2] 2000

    ? ? ? ?

    [RB] ? [R3]

    ? [R5]

    ? [R4] 50

    [R4] 50

    ? ? ? ?

    [RZ] ? ? ? [R2]+#20

    2020 [R4]-#3

    47

    [R3] & #0x3A

    32

    [R2]+ [R4] 2050

    ? ?

    [RY] ? ? ? ? [R2]+#20

    2020 [R4]-#3

    47

    [R3] & #0x3A

    32

    [R2]+ [R4] 2050

    ?

    [R2] 2000 2000 2000 2000 2000 2000 2000 2000 2000

    [R3] ? ? ? ? ? [R2]+#20

    2020 [R2]+#20

    2020 [R2]+#20

    2020 [R2]+#20

    2020

    [R4] 50 50 50 50 50 50 50 50 50

    [R5] ? ? ? ? ? ? [R4]-#3

    47 [R4]-#3

    47 [R4]-#3

    47

    [R6] ? ? ? ? ? ? ? [R3] & #0x3A

    32

    [R3] & #0x3A

    32

    [R7] ?

    ? ? ? ? ? ? ? [R2]+ [R4]

    2050

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 8 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    6.7

    6.7 [M] Assume that 20 percent of the dynamic count of the instructions executed for a program are

    branch instructions. Delayed branching is used, with one delay slot. Assume that there are no stalls

    caused by other factors.

    (a.)First, derive an expression for the execution time in cycles if all delay slots are filled with NOP

    instructions.

    Branch 20% Delay slots 100% No-Operation

    --

    (b.)Then, derive another expression that reflects the execution time with 70 percent of delay slots filled

    with useful instructions by the optimizing compiler.

    Branch 20% Delay slots 30% No-Operation Delay slots 70% Useful Operation

    --

    (c.) From these expressions, determine the to the increase in performance,

    expressed as a speedup percentage.

    --

    The Optimizing Compiler with a 1-branch-delay processor makes operations faster than

    the same processor without an Optimizing Compiler.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 9 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    6.8

    6.8 [D] Repeat Problem 6.7, but this time for a pipelined processor with .

    The output from the optimizing compiler is such that the first delay slot is filled with a useful instruction

    70 percent of the time, but the second slot is filled with a useful instruction only 10 percent of the time.

    Compare the compiler-optimized execution time for this case with the compiler-optimized execution

    time for Problem 6.7. Assume that the two processors have the same clock rate. Indicate which

    processor/compiler combination is faster, and determine the speedup percentage by which it is faster.

    --

    (a.)The execution time in cycles if all delay slots are filled with NOP instructions.

    (a.)The execution time in cycles if the delays are filled as described above.

    Branches 20% Of Operations Slot #1

    Delay slot 30% No-Operation Delay slot 70% Useful Operation

    Slot #2 Delay slot 90% No-Operation Delay slot 10% Useful Operation

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 10 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    (c.) From these expressions, determine the to the increase in performance,

    expressed as a speedup percentage.

    --

    The Pipeline, with optimizing compiler is best:

    An Optimizing Compiler for a 2-branch -delay processor makes executions faster than

    the same processor without an Optimizing Compiler.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 11 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    6.14

    6.14 [E] Assume that a program contains no branch instructions. It is executed on the superscalar

    processor shown in Figure 6.13.

    (a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of

    75 percent arithmetic instructions and 25 percent memory-access instructions?

    Arithmetic (75%)+ Memory Access(25%)

    --

    Given the assumptions marked on Figure 6.13 above the Arithmetic path takes on average:

    Again, given the assumptions marked on Figure 6.13 above the Arithmetic path takes:

    Because there is a significantly un-equal amount of instructions passed through each pipeline, we can

    safely assume that the Load/Store( 1/4th Of Total Instructions MINORITY) will be able to finish before

    the Arithmetic( 3/4th Of Total Instructions MAJORITY) .

    1Cycle

    1Cycle 1Cycle 1Cycle

    1Cycle

    1Cycle 2Cycles

    1Cycle

    Assumptions

    In Yellow

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 12 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    the pipeline:

    (b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2

    using the same clock?

    --

    Given the assumptions marked on Figure 6.2 above the simple pipeline will be limited by the memory

    access:

    Assumptions

    In Yellow

    1Cycle

    1Cycle

    1Cycle

    1Cycle

    2Cycles

    1Cycle

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 13 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    Thus our comparison now becomes:

    Vs.

    Because both processors would be dealing with the same set of instructions and the same clock our

    comparison simplifies:

    Vs.

    So we saved a little more than one cycle by re-routing memory access to a separate pipeline we could

    have saved even more cycles by having a more balanced set of instructions :(ie. Arithmetic (50%) ;

    Memory Access(50%))

    And so we find that our Double-Path-Super-Scalar-Processor is faster than a Single-Path-

    Pipelined Processor.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 14 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    6.15

    HAS BRANCHES

    6.15 [M] Repeat Problem 6.14 to find the best possible execution times for the processors in Figures 6.2

    and 6.13, assuming that the mix of instructions consists of 15 percent branch instructions that are never

    taken, 65 percent arithmetic instructions, and 20 percent memory access instructions. Assume a

    prediction accuracy of 100 percent for all branch instructions-(Branch delay is minimized

    ).

    --

    (a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of

    Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%) instructions?

    Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%)

    --

    Making the same assumptions as in problem 6.14, marked on Figure 6.13 above;

    Because the number of Arithmetic instructions outweighs the number of Memory Accesses the

    .

    Where again the Arithmetic path takes:

    And we now have the additional possibility of a branch delay, HOWEVER THIS IS TAKEN CARE OF by the

    Fetch Sector, as a pre-cursor to the Execution Stage:

    So, in this scenario, our best Execution Time will be the SAME regardless of branching operations.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 15 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    (b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2

    using the same clock?

    --

    Making the same assumptions as in problem 6.14, marked on Figure 6.2 above;

    And we now have the additional possibility of a single-cycle branch delay:

    Thus our comparison becomes

    Vs.

    Because both processors would be dealing with the same set of instructions and the same clock our

    comparison simplifies:

    Vs.

    So we can see that the simple pipeline takes longer to reconcile the branch delays, but the super scalar

    pipeline is essentially un-effected by branching considerations (because this is taken care of in the fetch

    stage)

    Furthermore looking at our :

    And so we find that our Double-Path-Super-Scalar-Processor is faster than a Single-Path-

    Pipelined Processor when we also take branching effects into consideration.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 16 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    Suplementary(a,b,c)

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 17 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    KEY For Tables Of Operations

    Fetch=E

    Decode=D

    Execute (Using Processor Hardware)=E Execute With Register Renaming (Not Using Processor Hardware)=E

    Write Back=W

    Waiting on=X

    NOTE: This method of display, was developed in collaboration with Nathan Genetzky

    (Supplementary.A.)Completed The Preceding Table Of Operations

    NOTE: Horizontal = Temporal-axis

    :

    F D E W

    F D R3 E E E E E W

    F D ONE E W

    F D R6 R6 R6 R6 R6 E W

    F D E W

    F D R7 E W

    F D FIVE E W

    F D R1 R1 R1 E W

    F D E E E E E W

    F D R6 R6 R6 R6 R6 E W

    F D SEVEN E W

    R3 R7 R7 R6,R2 R5 R1 R0 R3 R6 R2

    OPS, ERROR, instruction 8 cannot use the ALU at the same time as instruction 3.

    https://sites.google.com/site/whoisjdu/
  • Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 18 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

    Time of print: 10:32 PM 10/8/2014

    (B.2)What is the savings using capability?

    --

    We save SEVEN cycles. Performing 11 operations in 17 cycles ( Using - Out Of Order Capability) instead of 24 cycles (NOT Using - Out Of Order

    Capability)- THIS IS REALLY SIGNIFICANT!!!!

    (Supplementary.B.)Completed The Preceding Table Of Operations NO OUT OF ORDER CAPABILITY NO REGISTER RENAMING

    NOTE: Horizontal = Temporal-axis

    :

    F D E W

    F D R3 E E E E E W

    F D ONE ONE ONE ONE ONE E W

    F D R6 R6 R6 R6 R6 E W

    F D THREE THREE THREE THREE E W

    F D R7 R7 R7 R7 R7 R7 E W

    F D FIVE FIVE FIVE FIVE FIVE FIVE E W

    F D R1 R1 R1 SIX SIX SIX E W

    F D SEVEN SEVEN SEVEN SEVEN SEVEN SEVEN E E E E E W

    F D R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 E W

    F D NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE E W

    R3 R6 R7 R1 R7 R2 R5 R0 R6 R2 R3

    https://sites.google.com/site/whoisjdu/