Computer Architecture Elementary Pipelining Study

Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 1 FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012

Time of print: 10:32 PM 10/8/2014

6.1
https://sites.google.com/site/whoisjdu/


Time of print: 10:32 PM 10/8/2014

6.1 [M] Consider the following instructions at the given addresses in the memory:

1000 Add R3, R2, #20

1004 Subtract R5, R4, #3

1008 And R6, R4, #0x3A

1012 Add R7, R2, R4

Initially, and . These instructions are executed in a computer that has a five-stage

pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining

instructions are fetched in successive cycles.


Time of print: 10:32 PM 10/8/2014

(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the

pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1

through 8.

--

(6.2.a) Flow Of Instructions : *Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle

Stage: 1 2 3 4 5 6 7 8

Operation:

1000 Add R3, R2,

#20

Fetch IR (1000)

Decode RA [R2] RB [R3]

Compute RZ [R2]+

#20

Memory RY [RZ]

Write R3 [RY]


Fetch

IR (1004)


Compute RZ [R4]-

#3

Memory RY [RZ]

Write R5 [RY]

1008 And R6, R4, #0x3A

Fetch

IR (1008)

Decode RA [RZ]

RB [R6]

Compute RZ

[R4]x3A

Memory RY [RZ]

Write R6 [RY]

1012 Add R7, R2, R4

Fetch

IR (1012)


Compute RZ [R2]+ [R4]

Memory RY [RZ]

Write R7 [RY]

(b) With reference to Figures 5.8 and 5.9, describe the contents of registers R2, R3, R4, R5, R6, R7, IR,

PC, RA, RB, RY, and RZ in the pipeline during cycles 2 to 8.

--


Time of print: 10:32 PM 10/8/2014

(6.1) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ :

Stage: 1 2 3 4 5 6 7 8 9

Register:

[PC] 1000 1004 1008 1012 ? ? ? ? ?

[IR] Add

(1004) Subtract (1004)

And (1008)

Add (1012)

? ? ? ? ?

[RA] ? [R2] 2000

[R4] 50

[R4] 50

[R2] 2000

? ? ? ?

[RB] ? [R3]

? [R5]

? [R6]

? [R4] 50

? ? ? ?

[RZ] ? ? ? [R2]+#20

2020 [R4]-#3

47 [R4] & #0x3A

50 [R2]+ [R4]

2050 ? ?

[RY] ? ? ? ? [R2]+#20

2020 [R4]-#3

47 [R4] & #0x3A

50 [R2]+ [R4]

2050 ?

[R2] 2000 2000 2000 2000 2000 2000 2000 2000 2000

[R3] ? ? ? ? ? [R2]+#20

2020 [R2]+#20

2020 [R2]+#20

2020 [R2]+#20

2020

[R4] 50 50 50 50 50 50 50 50 50

[R5] ? ? ? ? ? ? [R4]-#3

47 [R4]-#3

47 [R4]-#3

47

[R6] ? ? ? ? ? ? ? [R4] & #0x3A

50 [R4] & #0x3A

50

[R7] ? ? ? ? ? ? ? ? [R2]+ [R4]

2050


Time of print: 10:32 PM 10/8/2014

6.2

6.2 [M] Repeat Problem 6.1 for the following program:

1000 Add R3, R2, #20


1008 And R6, R3, #0x3A // Data Dependency R3!!!

1012 Add R7, R2, R4

Assume that the pipeline provides forwarding paths to the ALU from registers RY and RZ

in Figure 5.8 and that the processor uses forwarding of operands.

--

Initially, and . These instructions are executed in a computer that has a five-stage

pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining

instructions are fetched in successive cycles.


Time of print: 10:32 PM 10/8/2014

(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the

pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1

through 8.

--

(6.2.a) Flow Of Instructions : *Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle

Stage: 1 2 3 4 5 6 7 8

Operation:

1000 Add R3, R2,

#20

Fetch IR (1000)


Compute RZ [R2]+

#20

Memory RY [RZ]

Write R3 [RY]


Fetch

IR (1004)


Compute RZ [R4]-

#3

Memory RY [RZ]

Write R5 [RY]

1008 And R6, R3, #0x3A

Fetch

IR (1008) Decode

RA [RZ]

RB [R6]

Compute RZ

[R3]x3A

Memory RY [RZ]

Write R6 [RY]

1012 Add R7, R2, R4

Fetch

IR (1012)


Compute RZ [R2]+ [R4]

Memory RY [RZ]

Write R7 [RY]

(b) With reference to Figures 5.8 and 5.9, describe the contents of R2, R3, R4, R5, R6, R7, IR, PC, RA,

RB, RY, and RZ in the pipeline during cycles 2 to 8.

--


Time of print: 10:32 PM 10/8/2014

(6.2.b) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ :

Stage: 1 2 3 4 5 6 7 8 9

Register:

[PC] 1004 1004 1008 1012 ? ? ? ? ?

[IR] Add

(1000) Subtract (1004)

And (1008)

Add (1012)

? ? ? ? ?

[RA] ? [R2] 2000

[R4] 50

2020

[R2] 2000

? ? ? ?

[RB] ? [R3]

? [R5]

? [R4] 50

[R4] 50

? ? ? ?

[RZ] ? ? ? [R2]+#20

2020 [R4]-#3

47

[R3] & #0x3A

32

[R2]+ [R4] 2050

? ?

[RY] ? ? ? ? [R2]+#20

2020 [R4]-#3

47

[R3] & #0x3A

32

[R2]+ [R4] 2050

?

[R2] 2000 2000 2000 2000 2000 2000 2000 2000 2000

[R3] ? ? ? ? ? [R2]+#20

2020 [R2]+#20

2020 [R2]+#20

2020 [R2]+#20

2020

[R4] 50 50 50 50 50 50 50 50 50

[R5] ? ? ? ? ? ? [R4]-#3

47 [R4]-#3

47 [R4]-#3

47

[R6] ? ? ? ? ? ? ? [R3] & #0x3A

32

[R3] & #0x3A

32

[R7] ?

? ? ? ? ? ? ? [R2]+ [R4]

2050


Time of print: 10:32 PM 10/8/2014

6.7

6.7 [M] Assume that 20 percent of the dynamic count of the instructions executed for a program are

branch instructions. Delayed branching is used, with one delay slot. Assume that there are no stalls

caused by other factors.

(a.)First, derive an expression for the execution time in cycles if all delay slots are filled with NOP

instructions.

Branch 20% Delay slots 100% No-Operation

--

(b.)Then, derive another expression that reflects the execution time with 70 percent of delay slots filled

with useful instructions by the optimizing compiler.

Branch 20% Delay slots 30% No-Operation Delay slots 70% Useful Operation

--

(c.) From these expressions, determine the to the increase in performance,

expressed as a speedup percentage.

--

The Optimizing Compiler with a 1-branch-delay processor makes operations faster than

the same processor without an Optimizing Compiler.


Time of print: 10:32 PM 10/8/2014

6.8

6.8 [D] Repeat Problem 6.7, but this time for a pipelined processor with .

The output from the optimizing compiler is such that the first delay slot is filled with a useful instruction

70 percent of the time, but the second slot is filled with a useful instruction only 10 percent of the time.

Compare the compiler-optimized execution time for this case with the compiler-optimized execution

time for Problem 6.7. Assume that the two processors have the same clock rate. Indicate which

processor/compiler combination is faster, and determine the speedup percentage by which it is faster.

--

(a.)The execution time in cycles if all delay slots are filled with NOP instructions.

(a.)The execution time in cycles if the delays are filled as described above.

Branches 20% Of Operations Slot #1

Delay slot 30% No-Operation Delay slot 70% Useful Operation

Slot #2 Delay slot 90% No-Operation Delay slot 10% Useful Operation


Time of print: 10:32 PM 10/8/2014

(c.) From these expressions, determine the to the increase in performance,

expressed as a speedup percentage.

--

The Pipeline, with optimizing compiler is best:

An Optimizing Compiler for a 2-branch -delay processor makes executions faster than

the same processor without an Optimizing Compiler.


Time of print: 10:32 PM 10/8/2014

6.14

6.14 [E] Assume that a program contains no branch instructions. It is executed on the superscalar

processor shown in Figure 6.13.

(a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of

75 percent arithmetic instructions and 25 percent memory-access instructions?

Arithmetic (75%)+ Memory Access(25%)

--

Given the assumptions marked on Figure 6.13 above the Arithmetic path takes on average:

Again, given the assumptions marked on Figure 6.13 above the Arithmetic path takes:

Because there is a significantly un-equal amount of instructions passed through each pipeline, we can

safely assume that the Load/Store( 1/4th Of Total Instructions MINORITY) will be able to finish before

the Arithmetic( 3/4th Of Total Instructions MAJORITY) .

1Cycle

1Cycle 1Cycle 1Cycle

1Cycle

1Cycle 2Cycles

1Cycle

Assumptions

In Yellow


Time of print: 10:32 PM 10/8/2014

the pipeline:

(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2

using the same clock?

--

Given the assumptions marked on Figure 6.2 above the simple pipeline will be limited by the memory

access:

Assumptions

In Yellow

1Cycle

1Cycle

1Cycle

1Cycle

2Cycles

1Cycle


Time of print: 10:32 PM 10/8/2014

Thus our comparison now becomes:

Vs.

Because both processors would be dealing with the same set of instructions and the same clock our

comparison simplifies:

Vs.

So we saved a little more than one cycle by re-routing memory access to a separate pipeline we could

have saved even more cycles by having a more balanced set of instructions :(ie. Arithmetic (50%) ;

Memory Access(50%))

And so we find that our Double-Path-Super-Scalar-Processor is faster than a Single-Path-

Pipelined Processor.


Time of print: 10:32 PM 10/8/2014

6.15

HAS BRANCHES

6.15 [M] Repeat Problem 6.14 to find the best possible execution times for the processors in Figures 6.2

and 6.13, assuming that the mix of instructions consists of 15 percent branch instructions that are never

taken, 65 percent arithmetic instructions, and 20 percent memory access instructions. Assume a

prediction accuracy of 100 percent for all branch instructions-(Branch delay is minimized

).

--

(a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of

Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%) instructions?

Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%)

--

Making the same assumptions as in problem 6.14, marked on Figure 6.13 above;

Because the number of Arithmetic instructions outweighs the number of Memory Accesses the

.

Where again the Arithmetic path takes:

And we now have the additional possibility of a branch delay, HOWEVER THIS IS TAKEN CARE OF by the

Fetch Sector, as a pre-cursor to the Execution Stage:

So, in this scenario, our best Execution Time will be the SAME regardless of branching operations.


Time of print: 10:32 PM 10/8/2014

(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2

using the same clock?

--

Making the same assumptions as in problem 6.14, marked on Figure 6.2 above;

And we now have the additional possibility of a single-cycle branch delay:

Thus our comparison becomes

Vs.

Because both processors would be dealing with the same set of instructions and the same clock our

comparison simplifies:

Vs.

So we can see that the simple pipeline takes longer to reconcile the branch delays, but the super scalar

pipeline is essentially un-effected by branching considerations (because this is taken care of in the fetch

stage)

Furthermore looking at our :

And so we find that our Double-Path-Super-Scalar-Processor is faster than a Single-Path-

Pipelined Processor when we also take branching effects into consideration.


Time of print: 10:32 PM 10/8/2014

Suplementary(a,b,c)


Time of print: 10:32 PM 10/8/2014

KEY For Tables Of Operations

Fetch=E

Decode=D

Execute (Using Processor Hardware)=E Execute With Register Renaming (Not Using Processor Hardware)=E

Write Back=W

Waiting on=X

NOTE: This method of display, was developed in collaboration with Nathan Genetzky

(Supplementary.A.)Completed The Preceding Table Of Operations

NOTE: Horizontal = Temporal-axis

:

F D E W

F D R3 E E E E E W

F D ONE E W

F D R6 R6 R6 R6 R6 E W

F D E W

F D R7 E W

F D FIVE E W

F D R1 R1 R1 E W

F D E E E E E W


F D SEVEN E W

R3 R7 R7 R6,R2 R5 R1 R0 R3 R6 R2

OPS, ERROR, instruction 8 cannot use the ALU at the same time as instruction 3.


Time of print: 10:32 PM 10/8/2014

(B.2)What is the savings using capability?

--

We save SEVEN cycles. Performing 11 operations in 17 cycles ( Using - Out Of Order Capability) instead of 24 cycles (NOT Using - Out Of Order

Capability)- THIS IS REALLY SIGNIFICANT!!!!

(Supplementary.B.)Completed The Preceding Table Of Operations NO OUT OF ORDER CAPABILITY NO REGISTER RENAMING

NOTE: Horizontal = Temporal-axis

:

F D E W

F D R3 E E E E E W

F D ONE ONE ONE ONE ONE E W


F D THREE THREE THREE THREE E W

F D R7 R7 R7 R7 R7 R7 E W

F D FIVE FIVE FIVE FIVE FIVE FIVE E W

F D R1 R1 R1 SIX SIX SIX E W

F D SEVEN SEVEN SEVEN SEVEN SEVEN SEVEN E E E E E W

F D R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 E W

F D NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE E W

R3 R6 R7 R1 R7 R2 R5 R0 R6 R2 R3

Documents

Computer Architecture Elementary Pipelining Study