Upload
vinny
View
62
Download
0
Embed Size (px)
DESCRIPTION
Outline Introduction Version 1 EMY CPU : Pipelined EMY CPU It executes only integer instructions How a memory hierarchy can be attached to the pipelined EMY CPU is also studied Version 0 , the Unpipelined EMY CPU is described in another presentation Handout to use Pipelined EMY CPU. - PowerPoint PPT Presentation
Citation preview
Computer Architecture and OrganizationCS 2214CS 2214
Haldun Hadimioglu
Computer Science & Engineering
Pipelined EMY CPU
Version 1
Spring Spring 20142014
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 2CS 2214
Outline Introduction Version 1 EMY CPU : Pipelined EMY CPU
It executes only integer instructions How a memory hierarchy can be attached to the
pipelined EMY CPU is also studied Version 0, the Unpipelined EMY CPU is described in
another presentation
Handout to use Pipelined EMY CPU
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 3CS 2214
Introduction On the microarchitecture layer, a computer is a
collection of at least three interconnected digital systems
A central processing unit (CPU) A (main) memory An I/O controller to control an I/O device, such as the
disk There can be several I/O controllers to control several
different I/O devices
Memory
CPU
I/OController
InterconnectionSystem
Disk
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 4CS 2214
Digital Systems A digital system performs microoperations
It consists of a datapath (data unit) and a control unit
The datapath actually performs the microoperations The control unit determines which microoperation
happens when
Registers ALUs Buses
SequencerStatus signals Control signals
Datapath
Control Unit
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 5CS 2214
Digital Systems The datapath (data unit) has registers,
ALUs and buses to perform the microoperations
Registers keep information temporarilyALUs perform arithmetic/logic operationsBuses interconnect the registers and ALUsOther components are used include
Multiplexers (MUXes), decoders, encoders, comparators, counters, etc.
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 6CS 2214
Digital Systems The control unit has a sequencer that
determines the sequence of microoperations
The sequencer needs status signals from the data unit to know what is happening there
Then, based also on the current state it determines which microoperations to be performed and indicates to the datapath by means of control signals
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 7CS 2214
Designing Digital systems Datapath design is simpler than the
control unit since it has highly regular (duplicated) circuits
A 64-bit ADDer is composed of 4 16-bit identical ADDers
A 64-bit comparator consists of 8 8-bit identical comparators, etc.
Control unit design is more difficult due to Large amounts of random logicA substantial amount of effort is needed to
make sure there are no timing problems Microoperations must start at the right time and end
at the right time !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 8CS 2214
Designing digital systems We will use the finite-state machine (FSM)
technique to design the EMY CPU where the FSM state diagram will have states with microoperations
The state diagram shows which state follows which state precisely
Each state indicates which microoperations to perform
The state diagram shows which states are needed when for which machine language instruction
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 9CS 2214
Designing digital systems We will design the EMY CPU by using the
finite-state machine (FSM) techniqueMore specifically, we will obtain the following
for the complete EMY CPU design A high-level-state diagram to show which
microoperation happens when The datapath from the high-level state diagram The low-level state diagram from the high-level sate
diagram and the datapath The control unit from the low-level state diagram
It can be implemented by hardwiring and/or microprogramming
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 10CS 2214
Designing the microarchitecture level of a computer There are two tasks in this design
Develop the CPU and memory digital systems so that instructions can be run
Develop the memory and I/O controller digital systems so that I/O can happen
We will concentrate on the CPU and memory digital systems
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 11CS 2214
Designing the CPU and memory digital systems First we focus on the CPU digital system while we make a
few design decisions on the memory quickly We have designed the CPU as a slow CPU running only integer
instructions : No pipelining This is Version 0
We assumed the memory was fast which is not realistic today We will see how a memory hierarchy with cache memories, etc. can
be incorporated This CPU coverage is given in another PowerPoint presentation
Now, we improve the CPU speed by using pipelining, but still running integer instructions
This is Version 1 We will assume the memory is fast which is again not realistic today Then, we will see how a memory hierarchy with cache memories, etc.
can be incorporated For both versions the memory will be a black box with a
few details
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 12CS 2214
Designing the CPU as a Digital System The unpipelined EMY CPU digital system has
been designed for nine integer instructionsWe obtained its
High-level state diagram Datapath Low-level state diagram Control unit
We will design the pipelined EMY CPU digital system for eight integer instructions
We will obtaine its High-level state diagram Datapath
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 13CS 2214
Designing the Unpipelined CPU digital system To design the unpipelined EMY CPU, we started
with the EMY architecture What is the connection between the architecture and
the CPU? A computer processes digital information, by running
machine language instructions A machine language program is a list of instructions each
of which specifies operations on data (arguments) An instruction specifies architectural operations Each architectural operation is implemented by
microoperations
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 14CS 2214
Designing the Unpipelined CPU Digital System In order to perform an architectural operation,
the CPU performs a series of microoperations in a number of clock periods
That is an architectural operation is broken down into smaller operations called microoperations
That is, to run a machine language instruction, the CPU performs microoperations
The CPU performs some microoperations by itself and some in cooperation with the memory and the I/O controllers
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 15CS 2214
Designing the Unpipelined CPU Digital System Architectural operations
An architectural operation is what we describe as the semantics of the instruction, such as
The architectural operation specified by the ADD instruction Rd Rs + Rt
The architectural operation specified by the SUB instruction Rd Rs - Rt
The architectural operation specified by the SLT instruction If Rs < Rt then Rd 1 else Rd 0
The architectural operation specified by the J instruction PC[27-0] (Address * 4)
It is the CPU that contributes the most to the execution of an instruction since it performs most of the microoperations needed for an architectural operation
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 16CS 2214
Designing the Unpipelined CPU Digital System Typical CPU digital system microoperations
Add, subtract, multiply In the past, a 32-bit addition was completed in 1 clock
period. Today, a 32-bit addition is completed in several clock periods
AND, OR, XOR Shift right, Shift left Read data from memory, write data to memory
In the past, a memory access was completed in 1 clock period.
Today, it is completed in several clock periods Read instructions from memory (fetch) Increment the program counter Transfer a register to another register …
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 17CS 2214
Designing the Unpipelined CPU as a Digital System Other machines, especially CISC machines,
require other microoperations such as Reading indirect address(es) from the memory Effective address calculation for
Indexing Autoincrement Autodecrement
Alignment for Instructions Data Addresses
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 18CS 2214
Designing the Unpipelined CPU Digital System Architecture’s effect on microoperations
The decisions made on architecture determine the microoperations needed for the execution of the instructions General microoperations found on most CPUs
The ones mentioned on previous slides Specific microoperations for certain CPUs Specific microoperations for Memory Management Units
(MMUs), caches, I/O controllers The architecture also determines the characteristics of each
microoperation If the 26-bit PC-direct addressing mode is used, the rightmost
26 bits of IR are catenated the leftmost 4 bits of PC and the resulting 30 bits are shifted to the left by 2
Thus, each machine language instruction requires a number of certain microoperations taking a certain time : the CPI i
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 19CS 2214
Designing the Unpipelined CPU Digital System Microoperations
The CPU can perform one or more microoperations per clock period, depending on the complexity of the microoperation and the availability of the hardware resources
Most often a microoperation can be completed in one clock period unless it is a complex microoperation
If a complex microoperations is desired to be run in a clock period, the clock period needs to be longer
The more and complex the microoperations are, the longer it takes to run the machine language instruction
CISC instructions take longer time to execute (larger CPIi)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 20CS 2214
Designing the Unpipelined CPU Digital System Calculating CPIi
The time it takes to run an instruction, CPIi, is then determined by
The number of microoperations needed for it The complexity of the microoperations
The number of clock periods for an instruction, CPIi, becomes a matter of figuring out the microoperations and how to distribute them to individual clock periods
One can come up with 5-10 simple microoperations to be performed one after another, resulting in a CPIi of 5-10
But, since microoperations are simple, the clock period is short
Alternatively, one can come up with 2-4 complex microoperations, resulting in a CPIi of 2-4
But, the clock period is longer
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 21CS 2214
Designing the Unpipelined CPU Digital System Calculating CPIi
What can we do ? Few long clock periods vs. many but shorter clock
periods ? Since increasing the clock frequency is important for
marketing purposes the second option would weigh in substantially
It turns out that if pipelining is implemented, having many shorter clock periods would be beneficial as we will see
CPIi figures will be large but CPIave will be close to 1 (one) ! Today’s microprocessors have instruction CPIi values in
the range of 10-30, but CPIave figures for their targeted applications are even less than 1 (one) !
Because they employ advanced pipelining techniques, such as superscalar execution, hyperthreading, etc.
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 22CS 2214
Designing the Unpipelined CPU Digital System Determining microoperations for a machine
language instruction Some microoperations are performed for all the
instructions Usually at the same point in time during the execution of
every instruction Fetching the instruction is always the first microoperation to
perform for all CPUs Updating PC (PC PC + 4) so that it points at the next
instruction is also universal The other microoperations depend on the instruction,
the addressing mode, where the arguments are, the length of the arguments, etc.
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 23CS 2214
Designing the Unpipelined CPU Digital System Determining microoperations for a machine
language instruction We would list all the microoperations for each
instruction, by making sure that we are consistent in terms of
Bus usage We often decide an approximate number of buses we need
for our datapath Today’s CPUs have at least three internal buses to
complete an integer arithmetic microoperation in one clock period
Two buses carry the numbers from two registers and the third bus carries the result to a register
ALU usage An ALU is expensive and so we try to limit the number of
ALUs
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 24CS 2214
Designing the Unpipelined CPU Digital System Determining microoperations for a machine
language instruction We would list all the microoperations for each
instruction, by making sure that we are consistent in terms of
Register usage Additional registers not visible to the architecture level are
used to keep temporary values : microarchitectural registers Typically, the more registers are used, the more clock periods
we spend for an instruction since temporary values will be passed from one register in one clock period to another register to be used the following clock period
But, sometimes we have to use microarchitectural registers, such as the instruction register that keeps the current instruction
Control unit usage
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 25CS 2214
Designing the Unpipelined CPU Digital System Determine how each EMY architectural operation
is implemented by microoperations Most microoperations must be simple enough to
be completed in less than one clock period A few microoperations may not be completed in a clock
period For example a memory read may take several clock
periods since the memory is slower These long microoperations should be accommodated in
the high-level state diagram, the datapath, low-level state diagram and the control unit
We will assume in the beginning that every microoperation is completed in one clock period
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 26CS 2214
Designing the Unpipelined CPU Digital System The EMY microoperations implied by the EMY
machine language instructions include Instruction fetch, performed always Update PC for next instruction, performed always Effective address calculation for Displacement and
relative addressing modes Sign extension or catenation of 0s for data/addresses Reading data from the memory Writing data to the memory Perform an arithmetic/logic Register transfer Testing a condition
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 27CS 2214
What is Pipelining ? The unpipelined MIPS CPU can be
thought of having five stages that correspond to the five major cycles
For the unpipelined MIPS CPU, at any time only one stage is busy and the remaining ones are idle
IF ID EX MEM WB
Control Unit
Instructions InstructionsDatapath
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 28CS 2214
What is Pipelining ? The unpipelined CPU works like this :
Only, one instruction is in the pipeline !
IF ID MEMEX WB
1 2 3 4 5
LW R8, 0(R9) LW R8, 0(R9) LW R8, 0(R9) LW R8, 0(R9) LW R8, 0(R9)
Clock period
ADD R10, R8, R11
6
Continues this way…
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 29CS 2214
What is Pipelining ? Pipelining is the simultaneous execution of
multiple instructions in an assembly line fashion in a single CPU
IF ID MEMEX WB
1 2 3 4 5
Clock period
ADD R10, R18, R11 LW R8, 0(R9)ADD R12, R13, R14SW R12, 0(R15)BEQ R12, R0, 3ADD R10, R8, R11LW R8, 0(R9) LW R8, 0(R8)ADD R10, R8, R11ADD R12, R13, R14 LW R8, 0(R9)ADD R10, R8, R11ADD R12, R13, R14SW R12, 0(R15) LW R8, 0(R9)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 30CS 2214
What is Pipelining ? Pipelining is a microarchitectural
technique where consecutive instructions are executed overlappingly
Each instruction is in a pipeline stage All stages are busy
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 31CS 2214
What is a Stage ? Each stage is specialized hardware
corresponding to a specific major cycle IF, ID, EX, MEM, WB
The hardware for each major cycle can then be easily identified and often named stage
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 32CS 2214
What is Pipelining ? Pipelined execution of instructions is similar to
the assembly line manufacturing of cars
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 33CS 2214
What is Pipelining ? There are two differences
On a car assembly line there is only one type of car assembled
For the CPU the instructions executed are different Loads, Stores, A/L, Branch instructions
All the cars on an assembly line have the same requirements : the same pieces are placed on the cars
For the CPU, even if two back-to-back instructions are of the same type (for example two back-to-back Loads), they have different requirements (different effective addresses hence different memory locations are accessed)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 34CS 2214
What is Pipelining ? Because of these two differences, each
stage has to pass information related to the instruction it just worked on to the next stage
Temporary registers (latches, buffers) are used between two stages to pass the information about the instruction just leaving one stage and entering the next one
IF ID MEMEX WBLatches
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 35CS 2214
What is Pipelining ? Latches are then necessary to pass
information about an instruction from one stage to the next
Latches are also needed so that partial work done by one stage is passed to the next stage so the work continues
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 36CS 2214
What is the Pipe ? We give the name “pipe” to the set of
stages since the stages are cascaded in a single dimension forming a pipe where instructions
Enter from one endStay in a stage for one clock periodProceed to the next stageFinally exit from the other endBy which time the instruction execution is
completed
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 37CS 2214
What is Pipelining ? Consider a sequence of instructions and a
5-stage pipeline
Assume that all the instructions use the five stages
That is they all take five clock periods to complete their execution
This is not possible in real life but let’s assume this for the time being to understand pipelining quickly
…I9 I8 I7 I6 I5 I4 I3 I2 I1
IF ID EX MEM WBInstructions Instructions
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 38CS 2214
What is Pipelining ? The execution can be shown as follows
Stage
Time
IF
ID
EX
MEM
WB
1 2 763 4 85
I1
I1
I2
I1
I2
I3
I1
I2
I3
I4
I1
I2
I3
I4
I5
I2
I3
I4
I5
I6
I3
I4
I5
I6
I7
I4
I5
I6
I7
I8
0
IFIDEX
MEMWB
v vv
vvv
vvvv
vvvvv
vvvvv
vvvvv
vvvvv
Pipeline is full ≡ all stages are busy ≡ start-up time = 5 clock periods
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 39CS 2214
What is Pipelining ? Compared with unpipelining, the five
stages are more complex to allow overlapped execution
All stages take the same amount of time, one clock period
The length of the clock period is determined by the slowest stage
Because, it is difficult to obtain stages with equal amount of work hence time
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 40CS 2214
What is Pipelining ? If the CPU is unpipelined, the instructions would
take 5 clock periods each
CPIi = 5 Since each instruction is taking 5 clock periods
CPIave = 5 Since the number of clock periods divided by the number
of instructions run is 5
I1 I2 I3 I4 I5 I6 I7
5 10 15 20 25 30Time
35
periodsclock 5 735
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 41CS 2214
What is Pipelining ? If the CPU is pipelined, after the pipeline
becomes full (the start-up time), every clock period an instruction is completed as opposed to completing every 5 clock periods
CPIi = 5 Since each instruction is taking 5 clock periods
CPIave ≈ 1 Since after the start-up time, we complete one
instruction each clock period
I1 I2 I3 I4 I5 I6 I7
5 6 7 8 9 10Time
11
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 42CS 2214
What is Pipelining ? Once the pipeline is filled, each clock
period an instruction exits the pipelineEach clock period an instruction is completed
It seems each instruction takes one clock period to execute
CPIave ≈ 1 !!!
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 43CS 2214
What is Pipelining ? Assume for next few slides that the
unpipelined EMY CPU is converted to a pipelined CPU
CPILW = 5CPISW = 4CPIA/L R Format = 4CPIBEQ = 3
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 44CS 2214
What is Pipelining ? Consider the following piece of EMY code---400200 LW R8, 0(R9) ; R8 M[R9 + 0+]400204 ADD R10, R11, R12 ; R10 R11 + R12400208 SUB R13, R14, R15 ; R13 R14 – R15 40020C XOR R16, R17, R18 ; R16 <-- R17 + R18400210 SW R19, 0(R20) ; M[R20 + 0+] <-- R19400214 OR R21, R22, R23 ; R21 R22 | R23400218 SLT R24, R25, R26 ; If R25 < R26, R24 1, else R24 040021C BEQ R27, R28, 5 ; If R27 is equal to R28, branch to
400234---
This code is not realistic since the instructions are all independent of each other !
But, for the sake of understanding pipelining, we will use this piece of code !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 45CS 2214
What is Pipelining ? Let’s see its pipelined execution by using textbook’s
notation and assume that the memory takes one clock period
400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R26
40021C BEQ R27, R28, 5
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEMIF ID EX MEM
IF ID EX MEMIF ID EX MEM
IF ID EX MEM IF ID EX MEM
IF ID EX
IFIDEX
MEMWB
v vv
vvv
vvvv
vvvv
vvv
vv
vvvv
vvvv
vvvvv
vv
v
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 46CS 2214
What is Pipelining ? Textbook’s notation is hard to follow if there are more than
few instructions Also, the notation requires a lot of space even for few
instructions From now on, we will use our notation
The execution by assuming assume that the cache memories take one clock period and there is no miss
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R2640021C BEQ R27, R28, 5
IF ID EX MEM WB1 2 3 4 52 3 4 53 4 5 64 5 6 75 6 7 86 7 8 97 8 9 108 9 10
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 47CS 2214
What is Pipelining ? What if the EMY CPU was not pipelined ?
The execution timing would be as follows by assuming that the cache memories take one clock period and there is no miss
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R2640021C BEQ R27, R28, 5
IF ID EX MEM WB1 2 3 4 56 7 8 9
10 11 12 1314 15 16 1718 19 20 2122 23 24 2526 27 28 2930 31 32
The execution completes in 32 clock periods !
Pipelined execution takes 10 clock periods !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 48CS 2214
What is Pipelining ? Pipelining decreases the execution time of the
program, CPUtime The number of instructions run, NI, stays the same
We execute the same number of instructions for a program
The CPIi stays the same Often the unpipelined CPIi and Pipelined CPIi differ slightly
for efficient pipelining The Branch CPIi will reduce from 4 to 3 The A/L Format CPIi will go up from 4 to 5
Instructions go through the similar stages as the unpipelined case
But, we execute several instructions at the same time All the stages are busy now The CPU does more per clock period CPIave decreases
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 49CS 2214
What is Pipelining ? We execute more instructions per unit
time (a second)The throughput is increased
The MIPSave figure is increased The number of instructions executed per second is
increased The MFLOPSave figure is increased
The number of FP operations performed per second is increased
That is why companies like to mention the MIPSave and MFLOPSave figures for their new generations of microprocessors since each new generation improves the pipeline which directly improves MIPSave and MFLOPSave.
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 50CS 2214
What is Pipelining ? Pipelining does not decrease the CPIi of
each individual instruction but increases the clock period slightly
The execution time of each instruction in terms of seconds is increased slightly !
This is due to the slightly longer clock period This is due to overhead of handling several
instructions per clock period
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 51CS 2214
Hardware-related issues to solve The stages must be precisely timed,
synchronized Each stage must take the same amount of time Each stage must have about the same amount of work
This is hard to come up unless it is a RISC architecture Suppose that we managed to have the same
amount of work per stage so that each stage takes the same time
What is the clock period ? Theoretically the clock period can stay the same as the
unpipelined CPU But the simultaneous execution increases the overhead
per clock period The clock period duration is increased slightly !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 52CS 2214
Hardware-related issues to solve A solution to these two problems today is to
break up stages that are taking too long into several simpler stages so that the stages are finer
Then, the pipeline is longer ≡ there are many stages Since each stage is doing simpler work, the clock period is
shorter ≡ the clock frequency is higher Today, a technique to increase the microprocessor
frequency is exactly this ≡ make stages simpler and simpler and simpler ≡ make pipelines longer and longer and longer
Today’s microprocessor pipelines are typically 15 to 25 stages long
Clock skew problems can cause timing problems A signal may arrive too late to play a role in generating
another signal since the pipeline is very long !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 53CS 2214
Pipelined EMY CPU Design In CS2214, we design the EMY CPU by going
through two versions : 0 through 1 Version 0 is the unpipelined CPU executing only integer
instructions Version 1 is the pipelined CPU executing only integer
instructions Initially, the Version 1 design will not be an acceptable
design New hardware to handle pipelining is not identified For example, the latches between stages are not identified
The CPU must have latches, so we will quickly change the design It will not handle well certain situations called hazards There are three types of hazards : structural, data & control All programs have hazards, so we will quickly change the design
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 54CS 2214
Pipelined EMY CPU Design In CS2214, we design the EMY CPU by going
through two versions : 0 through 1 Version 0 is the unpipelined CPU executing only integer
instructions Version 1 is the pipelined CPU executing only integer
instructions Initially, the Version 1 design will not be an acceptable
design Branch instructions take too long causing pipeline startups Control instructions must take shorter time, so we will quickly
change the design It will assume ideal memory All memory accesses take one clock period We will partially deal with the slower memory and leave the
rest to the Computer Architecture II course It will have imprecise interrupts We will leave it to the Computer Architecture II course We will not discuss the control unit, but we will know that it is
there So, somehow, the initial design of this version of
MIPS CPU executes the code in a pipelined fashion
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 55CS 2214
Pipelined EMY CPU Design Versions We will design the pipelined MIPS CPU Version 1
in several steps As mentioned above, initially, the Version 1 design will
not be an acceptable design The final design of Version 1 will improve the pipeline by
introducing additional hardware to better handle integer instructions
New hardware, including latches, to handle pipelining will be identified
It will better handle the three hazards Branch instructions will take 2 clock periods
But, we will have delayed branches which is not practical It will still have some unacceptable features
It will assume slower Level 1 cache memories and misses on Level cache memories, but not misses on lower level cache memories
It will still have imprecise interrupts
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 56CS 2214
Pipelining EMY CPU Consider the mnemonic machine
language discussed before---400200 LW R8, 0(R9) ; R8 M[R9 + 0+]400204 ADD R10, R11, R12 ; R10 R11 + R12400208 SUB R13, R14, R15 ; R13 R14 – R15 40020C XOR R16, R17, R18 ; R16 <-- R17 + R18400210 SW R19, 0(R20) ; M[R20 + 0+] <-- R19400214 OR R21, R22, R23 ; R21 R22 | R23400218 SLT R24, R25, R26 ; If R25 < R26, R24 1, else R24 040021C BEQ R27, R28, 5 ; If R27 is equal to R28, branch to
400234---
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 57CS 2214
Pipelining EMY CPU Here is the execution of the code discussed earlier
This EMY CPU pipeline version has problems as mentioned on slides 53 and 54
This EMY CPU pipeline also makes assumptions that are not acceptable as mentioned on the next slide
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R2640021C BEQ R27, R28, 5
IF ID EX MEM WB1 2 3 4 52 3 4 53 4 5 64 5 6 75 6 7 86 7 8 97 8 9 108 9 10
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 58CS 2214
Issues with the Current Design This program will be executed without difficulty
since all instructions are independent of each other
1) There is no real application where all instructions are independent of each other Real-life applications have instruction dependencies
Instruction I1 generates a result that is used by another instruction, I2, so that I2 depends on I1
2) This code assumes we will always execute in sequence : even if we execute branch instructions That is, it assumes branches are never taken
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 59CS 2214
Improving Initial Version 1 Design The pipelined EMY CPU state diagram and
pipeline stages We will obtain the final state diagram and final
datapath after several iterations The initial design of Version 1 will be improved by
going through several designs First, we will add new hardware, including latches Second, we will handle hazards better Third, we will execute Branch instructions faster Fourth, we will assume slower memory and so Level 1
cache memories will be used Level 1 cache memories will take more than one clock period Level 1 cache memories will have cache misses
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 60CS 2214
Improving Initial Version 1 Design Version 1 will be improved by going through
several designs First we will add the hardware overhead, including
latches When we have pipelined execution, it is important not to
lose the information about the execution of each instruction
With pipelining, each stage does some work for the instruction and by doing so affects the architectural registers and the memory (the state)
Some piece of this state is needed to execute an instruction in latter stages
So, when we move an instruction from one stage to another, it is necessary to transfer the information related to the instruction to the next stage (to make the state of the instruction available to the next stage) so that correct execution happens
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 61CS 2214
Latching hardware Each stage starts with the “sum” of work that has
been done on its instruction in previous stages Each stage works on the instruction resulting in new
work that will be needed in later stages to complete the instruction execution
For that purpose stages are provided with latches In other words, a stage works on an instruction that has left
the previous stage and produces something related to the instruction and passes it to the next stage to be used in the next clock period
Thus, we need to save the work of a stage in temporary registers (latches) for the next stage
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 62CS 2214
Latching hardware So we need the latches (buffers)
The amount of storage (the number of latches) between two stages is not constant :
IF ID MEMEX WB Instructions
I7 I6 I5 I4 I3I8 I7 I6 I5 I4
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 63CS 2214
Latching hardware The new hardware
Four IRs Though not all the bits of the extra IRs are needed in
every stageTwo NPC registers Two ALUoutput registers One A register Two B registersOne Imm register One TA register One Zero flip-flop One MDR register
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 64CS 2214
Latching hardware Here is the new look of the MIPS CPU datapath with latches
The leftmost latch set (with NPC and IR) will be called latch set 2 since these latches are used by the second stage from left (ID)
The next latch set to the right (NPC, A, B, Imm and IR) is latch set 3, and so on
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 65CS 2214
Latching hardware We will identify the registers by using the
latch set number (or the stage number using the registers)
Latch set 2 registers (Stage 2 uses them) 2.NPC and 2.IR
Used by the second stage from left : IDLatch set 3 registers (Stage 3 uses them)
3.NPC, 3.A, 3.B, 3.Imm and 3.IR Used by the third stage from left : EX
Latch set 4 registers (Stage 4 uses them) 4.Zero, 4.ALUout, 4.B and 4.IR
Used by the fourth stage from left MEMLatch set 5 registers (Stage 5 uses them)
5.ALUout, 5.MDR and 5.IR
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 66CS 2214
Latching hardware What did we do ?
We identified latches for the pipelined execution of instructions
The initial implementation of Version 1 does not identify the latches
The initial implementation of Version 1 does not specify that there are four IR registers, two NPC registers, two ALUout registers, etc.
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 67CS 2214
Timing of Microoperations We need to know about the timing of microoperations
When does exactly the instruction fetch occur for the LW instruction ?
That is, we know the instruction fetch will happen in clock period 1 (one), but exactly when ?
Similarly when does exactly PC get its value updated to 400204 when we execute the LW ?
Note : On the unpipelined CPU, this code takes 32 clock periods !
----400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R2640021C BEQ R27, R28, 5----
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 68CS 2214
Timing of Microoperations We clock (store on) our registers at the end of a
clock period and therefore, registers change their values in the beginning of the next clock period
Therefore, IR gets its new value (the LW instruction) in beginning of the ID cycle (in clock period 2)
PC gets its new value (400204) in beginning of the ID cycle (in clock period 2)
Clock
Clock period 1 Clock period 2
PC 400200 400204 4002084001FC
IR ? LW R8, 0(R9) ADD R10, R11, R12?
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 69CS 2214
Instruction fetch (IF) Cycle Fetch the instruction pointed by PC to 2.IR
2.IR M[PC] Update PC by adding 4
PC PC + 4How about 2.NPC ?
Soon, we will see that !
This cycle will be more complex when we cover BEQ later
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 70CS 2214
Instruction decode/register fetch (ID) Cycle Prepare temporary registers A, B and Imm in case we need
the GPR registers, an effective address or an immediate operand3.A GPR[2.IR.Rs]
3.B GPR[2.IR.Rt]3.Imm 2.IR.DOImm+
How about 3.NPC & 3.IR ?
Soon, we will see them !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 71CS 2214
Execute (EX) Cycle for LW/SW Instructions How do we know we have a LW/SW instruction ?
The IR register for this stage (3.IR) was not transferred value from the IR register of the previous stage (2.IR)
We need to update the ID stage : 3.IR 2.IR
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 72CS 2214
Instruction decode/register fetch (ID) cycle Prepare temporary registers A, B and Imm and move IR to
the next stage3.A GPR[2.IR.Rs]3.B GPR[2.IR.Rt]3.Imm 2.IR.DOImm+
3.IR 2.IR
How about 3.NPC ?
Soon, we will see that !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 73CS 2214
Execute (EX) Cycle for LW/SW Instructions Calculate the effective address
4.ALUout 3.A + 3.Imm We should not forget to move 3.IR to the next stage
4.IR 3.IR
How about 4.TA, 4.Zero and 4.B ?
Soon, we will see them !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 74CS 2214
Memory access/branch completion (MEM) Cycle for LW Instructions Read the data from memory
5.MDR M[4.ALUout] We should not forget to move 4.IR to the next stage
5.IR 4.IR
How about 5.ALUout ?
Soon, we will see that !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 75CS 2214
Write-back (WB) Cycle for LW instructions Transfer MDR to a GPR register
GPR[5.IR.Rt] 5.MDR The LW takes 5 clock periods to execute : CPILW = 5
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 76CS 2214
Memory access/branch completion (MEM) Cycle for SW instructions The effective address is in 4.ALUoutput
Where is the data to store ? It is in 3.B We did not transfer 3.B to 4.B !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 77CS 2214
Execute (EX) Cycle for LW/SW Instructions Calculate the effective address
4.ALUout 3.A + 3.Imm We should not forget to move 3.IR to the next stage
4.IR 3.IR Transfer 3.B to 4.B
4.B 3.B
How about 4.TA and 4.Zero ?
Soon, we will see that !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 78CS 2214
Memory access/branch completion (MEM) Cycle for SW Instructions Write 4.B to the memory pointed by 4.ALUout
M[4.ALUout] 4.B The SW takes 4 clock periods to execute : CPISW = 4
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 79CS 2214
Execute (EX) Cycle for A/L R-format instructions Perform the operation specified by the Function field of 3.IR
4.ALUout 3.A op 3.B We have already moved 3.IR to the next stage
4.IR 3.IR
How about 4.TA and 4.Zero ?
Soon, we will see that !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 80CS 2214
Memory access/branch completion (MEM) Cycle for A/L R-format Instructions We could complete the execution of these instructions in this
cycle by transferring 4.ALUout to a GPR register But, we decide to complete the execution in the WB cycle to help us
handle data hazards better as we will see later
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 81CS 2214
Memory access/branch completion (MEM) Cycle for A/L R-format Instructions Transfer 4.ALUout and 4.IR to the next stage
5.ALUout 4.ALUout 5.IR 4.IR
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 82CS 2214
Write-back (WB) Cycle for A/L R-format instructions We transfer the result from 5.ALUout to a GPR register
GPR[5.IR.Rd] 5.ALUout A/L R-format instructions take 5 clock periods to execute
CPIA/L R-format = 5
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 83CS 2214
Execute (EX) Cycle for BEQ Instructions We need to store the result of compare of 3.A with 3.B on 4.Zero We need to calculate the effective address by adding PC and (4
times the Offset) But, is PC changed by the instructions behind the BEQ ? Yes !
We should have saved the PC value for BEQ on a new register : NPC in the IF cycle !
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
ALUout
B
Zero
IR
ALUout
MDR
IF ID EX MEM WB2 3 4 5
TA
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 84CS 2214
Execute (EX) Cycle for BEQ Instructions We need to study the execution of Branch instructions
more carefully
When the BEQ is in its EX stage, PC is 400608
400600 BEQ R8, R9, 4 ; Branch to 400614 if R8 = R9400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
WB
MEM
EX
ID
IF
?
?
?
?
BEQ
?
?
?
BEQ
?
?
BEQ
Clock period, PC1, 400600 3, 4006042, 400604
There is aProblem !
We detect that there is a BEQ in the beginning of its ID cycle (clock period 2)We then immediately
stop the IF stage from fetching any instruction and stop to add 4 to PC
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 85CS 2214
Execute (EX) Cycle for BEQ Instructions We know we have a BEQ in the ID stage when we decode it
PC is 400604 when the BEQ is in ID When the Branch reaches EX, it expects to have PC =
400604 What shall we do ?
We decide to have a new register to keep the PC value for the BEQ : NPC (New PC)
We save the PC value for the BEQ in NPC in the IF stage So 400604 moves with the BEQ into the EX stage
When the ID stage detects a BEQ It stops the IF stage fetching the next instruction We also have to stop incrementing PC so that if the condition is not
satisfied, we execute the instruction following the BEQ This is the instruction in location 400604 We should not execute the instruction 400608 after we execute the
BEQ
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 86CS 2214
Execute (EX) Cycle for BEQ instructions We change the IF and ID stages to include transfers to
2.NPC and 3.NPC The EX stage for the BEQ is like this
4.IR 3.IR 4.Zero If 3.A = 3.B then 1 4.TA 3.NPC + (3..Imm * 4)
Now, we have the correct PC value on 3.NPC in the EX stage
But, when do we write to PC so that we can branch ?
3.NPC has 400604
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 87CS 2214
Execute cycle (EX) Cycle for BEQ instructions We write to PC the clock period after the BEQ is in EX
We write to PC in the IF stage when it is clock period 4
The IF stage then changes PC and NPC if 4.Zero is 1 PC If (4.Zero) then 4.TA else if (2.IR.opcoce ≠ BEQ) then PC + 4 2.NPC If (4.Zero) then 4.TA else if (2.IR.opcoce ≠ BEQ) then PC + 4
We also need to clear 4.Zero so that a new Branch can be executed
4.Zero If (4.Zero) then 0
WB
MEM
EX
ID
IF
?
?
?
?
BEQ
?
?
?
BEQ
?
?
BEQ
Clock period, PC 1, 400600 3, 4006042, 400604
?
4, 400604
?
AND
5, 400614
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 88CS 2214
Execute (EX) Cycle for BEQ Instructions What shall we do with ADD, SUB and XOR ?
They should not be fetched until we know the BEQ result !
If the ID stage has a BEQ we stop the instruction fetch to the memory
But, we also have to clear 2.IR if it has a BEQ so we fetch an instruction the next clock period (clock period 5) : 4.IR has the BEQ in the 4th clock period
2.IR If 4.IR.opcode = BEQ then NOP else if (2.IR.opcode ≠ BEQ) then M[PC]
WB
MEM
EX
ID
IF
?
?
?
?
BEQ
?
?
?
BEQ
?
?
BEQ
Clock period, PC 1, 400600 3, 4006042, 400604
?
4, 400604
NOP
AND
5, 400614
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 89CS 2214
Execute (EX) Cycle for BEQ Instructions What if we continued with the ADD, SUB and XOR
? Would they change any architectural register or
memory ? NO ! Since we arranged the pipeline such that all register
writes and memory writes happen at the end of the pipeline
By that time we know we have a BEQ we stop them and flush out them
RISC architectures result in late writes that help the hardware designer
CISC architectures often require early writes in the pipeline
The hardware designer has to undo these early writes when a branch is finally recognized
Unnecessary pressure on the hardware designer
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 90CS 2214
Execute (EX) Cycle for BEQ Instructions Stopping the fetches, how does the execution
look ?WB
MEM
EX
ID
IF
?
?
?
?
BEQ
?
?
?
BEQ
?
?
BEQ
Clock period, PC 1, 400600 3, 4006042, 400604
?
4, 400604
?
NOP
AND
5, 400614
The pipeline is almost empty with only one instruction in the WB stage!There is only one instruction in the pipeline
This is why Control instructions are important to deal with for pipelines
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 91CS 2214
Execute (EX) Cycle for BEQ instructions Showing the timing in a different way
IFIDEX
MEMWB
vv
v
vvvv
vvv
vv
v
11 2 3 4 5 6 7 8 9IF ID EX
IF ID EX MEM WB
????
???
?? ?A pipeline bubble
is generated
The Branch causesa pipeline start-up ! v
vvvv
400600 BEQ R8, R9, 4 400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 92CS 2214
Execute (EX) Cycle for BEQ Instructions In the 4th clock period we complete the
execution of the BEQ by writing the effective address to PC in IF
The control unit knows we are completing the BEQ instruction and so does not allow an instruction fetch
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 93CS 2214
Let’s rewrite microoperations for the BEQ IF stage
2.IR If 4.IR.opcode = BEQ then NOP else if (2.IR.opcode ≠ BEQ) then M[PC]
PC If (4.Zero) then 4.TA else if (2.IR.opcoce ≠ BEQ) then PC + 4 2.NPC If (4.Zero) then 4.TA else if (2.IR.opcoce ≠ BEQ) then PC + 4 4.Zero If (4.Zero) then 0
ID stage 3.NPC 2.NPC
EX stage 4.IR 3.IR 4.Zero If 3.A = 3.B then 1 4.TA 3.NPC + (3.Imm * 4)
The BEQ execution completes in the IF stage in the next clock period
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 94CS 2214
BEQ instructions take 4 clock periods to execute CPIBranch = 4
Since, the Branch execution is completed in the IF stage Overall, executing a control instruction first
creates a pipeline bubble and then causes a pipeline start-up where only one stage, IF, is busy It is therefore critical that the number of control
instructions be reduced by having Better programming styles Better compilers
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 95CS 2214
Cautions for the Pipelined EMY CPU With pipelining and memory hierarchies
hardware has become more sensitive to The number of instructions, NI (due to increased
memory hierarchy delays) The number of control instructions (due to pipeline and
memory hierarchy delays that can occur) Now we see why the pipeline is sensitive to control
instructions The order of instructions (due to pipeline delays that
can occur) Class notes on the remaining versions will show examples
why the pipeline is sensitive to a certain order of instructions
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 96CS 2214
What is Pipelining ? Before we continue with the evaluation of
our design, a comment :Pipelining is often invisible to the programmer,
though current architectures allow some visibility to help/improve pipeline
For example, knowing the pipeline length and how many clock periods complex microoperations take help the compiler to come up with a more efficient code
This is because a better order of instructions can be obtained
This is a point made earlier
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 97CS 2214
Pipelined Execution Timing The execution of the code on the Version 1 EMY
pipeline is shown again below by assuming that the cache memories take one clock period and there is no miss
Assume that our integer-instruction pipeline can execute the XOR, SLT, etc.
It takes 11 clock periods to run the code Note that the Branch completes in clock period 11 also !
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5 6 7 85 6 7 86 7 8 9 107 8 9 10 118 9 10
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R2640021C BEQ R27, R28, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 98CS 2214
The Speed Comparison The piece of program takes 11 clock periods on
the pipelined computer as opposed to 33 clock periods on the unpipelined
3 1133
CPUtimeCPUtime Speedup
new
oldoverall
4 8
32 NI
programfor periodsclock of# CPI pipe w/oave
1.375 811
NIprogramfor periodsclock of # CPI pipe w/ ave
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 99CS 2214
In general any pipeline will work fine if Every instruction is independent of every other instruction in
the pipeline at any moment Otherwise, we have what we call hazards as we will see soon
The number of control instructions is very small The order of instructions is good
Otherwise, we have what we call hazards as we will see soon There is a lot of hardware available
In the ideal case, CPIave ≡ the number of pipeline stages
In the ideal case, NI ≡ # of clock periods for the program
Speedupoverallideal = pipeline depth (the number of pipeline stages)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 100CS 2214
Ideal MIPS
If the CPU completes one instruction per clock period
We now see why microprocessor companies are eager to increase the clock frequency !
610frequencyclock periodclock per completedn instructio of # MIPSideal
6ideal10frequencyclock MIPS
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 101CS 2214
Pipeline Timing Due to start-ups and hazards, CPIave is not 1 The net effect of start-ups and hazards is that
more than one clock period is needed to execute an instruction on average
The amount of additional clock periods is due to the average delay cycles (stalls we will call soon) per instruction
ninstructioper (stalls) delays pipeline CPI CPI ideal aveave
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 102CS 2214
Pipeline Timing Since the ideal CPIave with pipelining is 1, we
obtain the following formula
It is clear from the above formula that the speedup is directly proportional to the number of pipeline stages
ninstructioper cycles stall Pipeline 1depth Pipeline Speedupoverall
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 103CS 2214
Pipeline Timing Example : Assume that a program with no control
instructions is run and the following measurements are made on the MIPS
Calculate CPIave and CPUtime for both unpipelined and pipelined cases and Speedupoverall, the pipelined efficiency and EMYideal for the pipelined case
Assume that clock frequency is 200MHz Note that this program is an ideal program since there is no
Store instruction ! NI = # of Loads + # of A/L = 10 + 90 = 100
Instruction CPIi # of times executed
Unpipelined time
Loads 5 10 0.25μsecA/L 5 90 2.25 μsec
5ns 10 5 10 200
1 frequencyClock 1 periodClock 9-
6
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 104CS 2214
Pipeline Timing Example continued
For the unpipelined case : CPUtimeunpipelined = TimeLoads + TimeA/L = 0.25 + 2.25 = 2.5
μsec
# of clock periods for Loads = # of times executed x CPIi
= 10 x 5 = 50 # of clock periods for A/L = # of times executed x CPIi
= 90 x 5 = 450 # of clock periods for program = # of clock periods for
Loads + # of clock periods for A/L = 50 + 450 = 5005
100500
NIprogramfor periodsclock of # CPI pipe w/oave
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 105CS 2214
Pipeline Timing Example continued
For the pipelined case : # of clock periods for program = Start-up time + (NI – 1) = = 5 + (100 – 1) = 104 CPUtimepipelined = # of clock periods for program x clock period = 104 x 5 = 520ns = 0.52 μsec
Speedupoverall is not 5 because of the startup time....
200 10
10 200 10frequencyclock MIPS 6
6
6ideal
4.81 .522.5
CPUtimeCPUtime Speedup
new
oldoverall
0,96 5
4.81 Speedup
Speedup efficiency Pipelineideal
overall
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 106CS 2214
Improving Initial Version 1 Design Now, we will make an assessment of pipelining to
prepare ourselves for next set of improvements Pipelining increases the speed but there are difficulties
and problems associated with pipelining : The hardware is complicated
Additional temporary registers (latches) are needed between stages so that latter stages can correctly work on an instruction
Some latches are simple duplication of other registers and some are latches that save the output of a stage.
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 107CS 2214
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The pressure on the memory is doubled : two memory accesses per clock period happen
One for instruction in the IF stage One for data in the MEM stage
For example, for the program execution on slide 97, the CPU makes two memory accesses in the 4th clock period
The frequency of simultaneous accesses depends on the number of Loads and Stores
The number of Loads and Stores depend on the application, programmer and compiler
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 108CS 2214
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Not all instructions require all the stages Some stages are empty, idle, creating a pipeline bubble
that cannot be avoided RISC instructions require fewer stages therefore the chance
having many unneeded stages is reduced With CISC, the number of stages is larger
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 109CS 2214
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The startup time slows the system Its impact is based on
The number of times it occurs (due to control instructions) The time it takes to fill the pipeline (pipeline depth or latency)
RISC systems perform better here since they have shorter pipelines
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 110CS 2214
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Some instructions have complex microoperations that take longer than one clock period to complete
Overall, it is difficult to have balanced stages
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 111CS 2214
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The clock period is determined by The slowest stage which is often the stage with the
addition and the stages with memory accesses The EX stage The IF and MEM stages
The latches that need set up time and propagation delays The clock skew problem
In RISC systems it is easy to distribute the work equally to stages but with CISC it is more difficult
So, in order not to increase the clock period length in CISC systems, a stage that has a complex microoperation takes more than one clock period
But, this creates bubbles !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 112CS 2214
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Because of what we call hazards, an instruction in the pipeline may not be moved to the next stage but forced to stay in the same stage more than one clock period
The instruction stalled The stages to the left of the stalled instruction cannot move
their instruction to the right to keep the strict order of execution
These stages become idle (do not work on new instruction) but keep the old instructions
This creates a pipeline bubble : The speed is decreased. Note that the startups also decrease the speed since there
is a larger bubble in the pipeline Control instructions result in startups Pipeline “hazards” also create startups if poorly designed
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 113CS 2214
Pipeline Hazards They are caused by a number of reasons
forcing the pipeline to stop the execution of an instruction and the instructions that are behind
The instructions are stalled The hazards generate either bubbles or a start-up of
the pipeline. There are three types of hazards
StructuralDataControl
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 114CS 2214
Structural Hazards Structural hazards occur from resource conflicts
that can be solved with more resources, i.e. more or faster hardware
Examples of structural hazards are Only one memory port in the CPU which stops the IF
stage if a Load/Store is using this single memory port to access data in the MEM stage
If a L1 cache memory takes two or more clock periods ! If the GPR set has only one write port and several
simultaneous GPR writes are performed, only one GPR write will happen, the others will write one by one
If a stage performs a complex microoperation taking several clock periods, such as FP arithmetic, and this microoperation is not pipelined, then instructions behind it will stay idle in their stages (these instructions are stalled)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 115CS 2214
Structural Hazards Due to a structural hazard, one or more
instructions behind the instruction that caused the hazard are delayed, are not allowed to move.
The stages behind the hazard causing instruction become idle : A bubble is generated
The bubble moves one stage per clock period and eventually leaves the pipeline.
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 116CS 2214
Structural Hazards An example
What if there was only one memory port ? If a Load or Store tries to access a data element in
the memory in the MEM cycle, then, the IF stage is forced to stay idle by the control unit so that the priority is given to the instruction already in the pipeline to complete it as soon as possible
The instruction that was going to be fetched is stalled
A bubble is created in the IF stage The bubble moves up the pipeline one stage per
clock period Stalling ends when the Load/Store complete the
memory access Next slide shows this process
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 117CS 2214
Structural Hazards What if there was only one memory port ?
11 2 3 4 5 6 7 8 9 10 11IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
Stall IF ID EX MEM WBIF ID EX MEM
IF ID EX MEMIF ID EX MEM
Stall IF ID
IFIDEX
MEMWB
v vv
vvv
vvv
vvvv
vvvv
v
vvv
vv
v
vvv
v
vv
v
v
vvv
????
???
?? ?
A bubble iscreated andmoves upthe pipeline
v
400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R2640021C BEQ R27, R28, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 118CS 2214
Structural Hazards What if there was only one memory port ?
We will avoid using textbook notation of instruction execution since even for a few instructions, a large space is needed to show the flow of execution
Rather, we will use our own notation shown belowIF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 75 6 7 8 86 7 8 9
8 9 10 117 8 9 10 11
10 11 12
XOR is delayed, stalled, in clock period 4 by the LW accessing the memory for its data
XOR is fetched in the 5th clock period, not in the 4th clock period
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20)400214 OR R21, R22, R23400218 SLT R24, R25, R2640021C BEQ R27, R28, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 119CS 2214
Structural Hazards What if there was only one memory port ?
The control unit stops the IF stage from accessing the memory to fetch the XOR
The reason is that we want to complete the execution of the LW that is already in the pipeline
Instructions in the pipeline has higher priority for completion
The SW instruction will access the memory in the 9th clock period to write data
There will not be an instruction fetch in the 9th clock period
Once a stall occurs, a bubble is introduced not all the stages are busy
The execution of the instruction is increased ≡ its CPIi is increased
CPIave is increased CPUtime is increased
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 120CS 2214
Structural Hazards What if there was only one memory port ?
We will not have this structural hazard in our system It is also clear from the Version 1 datapath diagram that
we have two separate memory ports Memory Port 1 for instruction fetches Memory Port 2 for data accesses
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 121CS 2214
Hazards Structural Hazards
Often, to solve structural hazards more or faster hardware is needed
However, the solution of the other two hazards, data and control hazards, requires More hardware and Better compilation techniques
To better order instructions To reduce the number of control instructions
The result is that Pipeline bubbles are eliminated or reduced The number of pipeline start-ups is also reduced
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 122CS 2214
Hazards The overall hardware structure that detects a
hazard and stops (stalls) an instruction or several instructions until the hazard condition does not exist is called pipeline interlock
Note that if an instruction is stalled, the instructions behind it are also stalled as we will see shortly
Thus, it is costly to stall a single instruction in the pipeline
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 123CS 2214
Data Hazards As mentioned before all previous program
examples had instructions independent of each other
The instructions did not have any register or memory location in common
For example, an instruction writes to R10 and the next instruction did not read R10
The second instruction did not depend on the first instruction
There is no data dependency between them There are other types of data dependencies as we will see
shortly If two instructions have data dependency between them
and they are in the pipeline there can be a data hazard Let’s see the definition on the next slide
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 124CS 2214
Data Hazards Data hazards occur between two
instructions which are executed close enough in time and there is writable data shared by them
That is there is a data dependency between two instructions and the correct result will occur only if the execution is confined to the sequential rather than pipelined execution to enforce the right order of access to the shared data
The second instruction cannot be executed in a pipelined fashion
It has to wait, stall ! This is sequential (unpipelined) execution then
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 125CS 2214
Data Hazards If we change the instruction sequence of the
previous code to include dependency, there will be data hazards
We observe that the ADD writes to R10 and the instructions below the ADD read R10
The ADD and the remaining instructions are executed close in time
Can there be data hazards among them ?
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 126CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it
The data element in R10 is shared by all the instructions below the ADD and they are executed close in time
An instruction, I1, writes to register and another instruction, I2, reads the same register (the data element)
I1 has to write first and then I2 has to read : There is a Read after Write (RAW) dependency
BUT, if I2 reads before I1 writes then there is a RAW hazard Can I2 read before I1 write ? Yes We have to stop I2 if it tries to read R2 before the ADD writes to
R2
RAW ?RAW ? RAW ? RAW ? RAW ? RAW ?
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 127CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that follow
it
There are data dependencies, but are they all data hazards ? Will all the instructions below the ADD try to read R10 before the
ADD writes ? NO ! Soon we will see that data hazards will happen between the ADD and
SUB, XOR and SW SUB, XOR and SW will try to read R10 before the ADD writes to R10 The OR, SLT and BEQ will read R10 after the ADD writes to R10
RAW RAWRAW
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 128CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it
SUB, XOR and SW will try to read R10 before the ADD writes to R10
These data dependencies result in data hazards This data hazard is one of three types of data hazards
An instruction, I1, writes to register and another instruction, I2, reads the same register (the same data element)
I1 has to write first and then I2 has to read : Read after Write (RAW)
If I2 reads before I1 writes there is a RAW hazard We will stall SUB, XOR and SW when they try to read R10
All RAW
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 129CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID Stall Stall Stall EX MEM WB
IF Stall Stall Stall ID EX MEMStall Stall Stall IF ID EX
Stall Stall Stall IF IDStall Stall Stall IF
Stall Stall Stall
IFIDEX
MEMWB
v vv
vvv
v
vv
vvv
vvvv
vvvvv
vv
????
???
?? ? vWhy do we stall the
SUB in the ID stage ?
We stall the SUB for 3 clock periods since it needs R10. This creates a 3-clock-period bubble that moves up the pipeline XOR is fetched and idling in the IF stage
v
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 130CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it We stalled the SUB in the ID stage since it reads its operands
in ID The SUB reads its operands R10 and R6 in the ID stage This is clock period 4
When will the ADD write to R10 ? In clock period 6 ! When will R2 actually get the new value ? In the beginning of the 7th clock period !
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID Stall Stall Stall EX MEM WB
IF Stall Stall Stall ID EX MEMStall Stall Stall IF ID EX
Stall Stall Stall IF IDStall Stall Stall IF
Stall Stall Stall
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 131CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it Why does R10 get its new value in the beginning of the 7th
clock period ? According to the state diagram of Version 1, the ADD writes
from 5.ALUout to its destination register in the WB stage This is clock period 6 Why does R2 get the value in the beginning of the 7th clock period
? As we discussed before, we clock (store on) our registers at the
end of a clock period and therefore, registers change their values in the beginning of the next clock period
Clock
Clock period 6 Clock period 7
5.ALUout Result of DADD ? ??
R2 ? Result of DADD ??
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 132CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it In summary then that the SUB is stalled in the ID stage for
three clock periods A 3-clock-period long bubble is created and moves up the
pipeline If we show the pipeline in our notationIF ID EX MEM WB
1 2 3 4 52 3 4 5 63 4/7 8 9 10
4/7 8 9 10 118 9 10 11
10 11 12 13 149 10 11 12 13
11 12 13
All
RA
W
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 133CS 2214
Pipeline Interlocks What we are doing is that we check for
hazard situations in the ID stage and when we recognize a hazard, we stall the instruction in the ID stage !
If an instruction does not have a hazard situation, it is allowed to proceed to the EX stage
That is the instruction is issued to the EX stage If the instruction has a hazard, it is stalled in
the ID stage by the pipeline interlock to preserve the execution pattern
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 134CS 2214
Pipeline Interlocks If an instruction is stalled in the ID stage,
then the instruction in the IF stage is stalled
That is the instruction behind the stalled instruction is not allowed to pass by and continue with its execution
This is called static issuing Static issuing reduces hardware since we do not
have to keep track of which instruction changed which part of the state
Because, if an instruction is stalled, it has to update the state before all instructions that follow it
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 135CS 2214
Pipeline Interlocks If dynamic issuing is allowed then an instruction
in the IF stage would pass by the stalled instruction in the ID stage and start its EX cycle
However, dynamic issuing results in other data hazards, WAR and WAW, to happen as we will discuss later
We need to have hardware not to allow an instruction behind a stalled instruction to update the state
Can we somehow allow this instruction to proceed ? Yes, we can allow it to generate its results
But, we have to buffer the results and write them to the destination after the stalled instruction is finished for correct execution pattern
We then need additional hardware to keep temporary results and keep track of instructions’ progress
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 136CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it What if SUB does not have a RAW hazard but XOR has ?
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID Stall Stall EX MEM WB IF Stall Stall ID EX MEM
Stall Stall IF ID EXStall Stall IF ID
Stall Stall IF
IFIDEX
MEMWB
v vv
vvv
v
vv
vvv
vvvv
?vvvv
vv
????
???
?? ? v
vvv
vv
vWe stall the XOR for 2 clock periods and create a 2-clock-period bubble that moves up the pipeline
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 137CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it What if SUB does not have a RAW hazard but XOR has ?
The XOR is in ID in the 5th clock period but has to wait until the 7th clock period
If we show the pipeline in our notationIF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5/7 8 9 10
5/7 8 9 10
9 10 11 12 13 8 9 10 11 12
10 11 12
All
RA
W
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 138CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it What if SUB and XOR do not have a RAW hazard but SW has ?
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB IF ID Stall EX MEM
IF Stall ID EX MEMStall IF ID EX
Stall IF ID
IFIDEX
MEMWB
v vv
vvv
v
vv
vvv
vvvv
?vvvv
v
????
???
?? ? v
vvv
vv
vWe stall the SW for 1 clock period and create a 1-clock-period bubble thatmoves up the pipeline v
v
vv
v
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 139CS 2214
Data Hazards Let’s concentrate on the ADD and the instructions that
follow it What if SUB and XOR do not have a RAW hazard but SW has ?
If we show the pipeline in our notationIF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5 6 7 85 6/7 8 9
8 9 10 11 126/7 8 9 10 11
9 10 11
RA
W
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 140CS 2214
Eliminating Hazards We will eliminate delays due to RAW hazards
We will write to GPR registers in the WB stage in the first half of the clock period and read GPR registers in the ID in the second half of the same clock period
We will add new hardware to eliminate other RAW delays
We will reduce the amount of delay due to control hazards
By assuming a certain compiler functionality we will eliminate the control hazard delays completely
However, this compiler functionality is not acceptable in real life
It does not allow software compatibility as we will see later
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 141CS 2214
Data Hazards Writing to a GPR in the first half – reading
the same GPR register in the second half of the same clock period
Consider the timing diagram of writing to R10 in the 6th clock period again
What if we clock (store on) R10 in the middle of the 6th clock period where there is a negative edge !?
That is, what if we do not write at the end of the 6th clock period, but the middle ?
This is possible by using negative-edge triggered GPR registers
So, we write from 5.ALUoutput to R10 in the middle of the clock period !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 142CS 2214
Data Hazards Writing to a GPR in the first half – reading the
same GPR register in the second half of the same clock period
OK, we write in the first half, can we read the same register in the second half ?
Yes, reading means getting the value from R10 in the second half and storing it on the destination register at the end of the same clock period when there is a positive edge
We read from GPR registers and store on temporary registers 3.A and 3.B in the ID stage
In this specific example R10 is stored on 3.B for the SUB instruction
This will save one clock period for us From now on the GPR registers are clocked by
negative edges and the other registers are clocked at positive edges
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 143CS 2214
Data Hazards Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period Let’s visualize what happens in clock periods 5, 6 and 7
Clock
Clock period 6 Clock period 7
5.ALUoutput Result of ADD ??
R10 ?
3.B ? ?? Result of ADD
Clock period 5
Result of ADD
In the 6th clock period R10 has its new value and is transferred to 3.BTherefore, the SUB can be in EX in the 7th clock period to use 3.B
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 144CS 2214
Data Hazards Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period Let’s see the new execution flow
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID Stall Stall EX MEM WBIF Stall Stall ID EX MEM
Stall Stall IF ID EX MEMStall Stall IF ID EX
Stall Stall IF IDStall Stall IF
IFIDEX
MEMWB
v vv
vvv
v
vv
vvv
vvvv
?vvvv
vv
????
???
?? ? v
v
We stall the SUB for 2 clock periods and create a 2-clock-period bubble thatmoves up the pipeline v
v
vv
1 2 3 4 5 6 7 8 9 10
We will draw short lines in the WB and ID stages to indicate that the RAW hazard has been resolved by the write-in-first-half-read-in-the-second-half feature
400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 145CS 2214
Data Hazards Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period If we show the pipeline in our notation
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4/6 7 8 9
4/6 7 8 9 107 8 9 10
9 10 11 12 138 9 10 11 12
10 11 12
All
RA
W
We will draw short lines in the WB and ID stages to indicate that the RAW hazard has been resolved by the write-in-first-half-read-in-the-second-half feature
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 146CS 2214
Data Hazards Writing to a GPR in the first half – reading the same GPR register
in the second half of the same clock period Will this help if SUB does not have a RAW hazard but XOR has ?
YES !IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5/6 7 8 9
5/6 7 8 9
8 9 10 11 127 8 9 10 11
9 10 11
All
RA
W
We saved one clock period !
Note that the GPR registers are always written in the middle of the clock period ! We show the short lines when this feature helps a RAW hazard !
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 147CS 2214
Data Hazards Writing to a GPR in the first half – reading the same GPR register
in the second half of the same clock period Will this help if SUB and XOR do not have a RAW hazard but SW has ?
YES !
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5 6 7 85 6 7 8
7 8 9 10 11 6 7 8 9 10
8 9 10
RA
W
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 148CS 2214
Data Hazards How will we eliminate the remaining two stall cycles ?
We will use forwarding also known as bypassing to do that This means we have additional hardware to eliminate the stalls The additional hardware will be new wires, new MUXes and MUX3 of the
datapath will be larger To visualize how we can do this, let’s look at the Version 1 state
diagram and the datapath for the ADD instruction
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID Stall Stall EX MEM WBIF Stall Stall ID EX MEM
Stall Stall IF ID EXStall Stall IF ID
Stall Stall IF IDStall Stall IF
1 2 3 4 5 6 7 8 9 10
The new value of R10 is calculated in the EX stage in the 4th clock period for the ADD
400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 149CS 2214
Data Hazards Forwarding (Bypassing)
The new value of R10 is stored on 4.ALUout at the end of the 4th clock period
The new value of R10 is available for use in the MEM stage in the beginning of the 5th clock period
Why do not we forward the new value of 4.ALUout directly from the MEM stage to the EX stage in the 5th clock period ?
At the same time, why do not we allow the SUB to read the old value of R10 to 3.B in the ID stage so we do not stall it in the 4th clock period ?
But, when the SUB enters the EX in the 5th clock period, it uses the forwarded value from 4.ALUout ? It bypasses the value of 3.B
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID Stall EX MEM WB
IF Stall ID EX MEMStall IF ID EX MEM
Stall IF ID EXStall IF ID
1 2 3 4 5 6 7 8 9 10
The arrow from MEM to EX indicates forwarding
400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 150CS 2214
Data Hazards Forwarding (Bypassing)
What we are doing is that instead of waiting to get the new value of R10 that goes (i) from the ALU to 4.ALUout, then (ii) to 5.ALUout then (iii) to R10 and then finally (iv) to 3.B, we forward the new value of R10 directly to the EX stage, to the input of the ALU, bypassing the value in 3.B that has the old R10 value
MUX3 is larger now
MU
X3
3.B
3.Im
m
4.ALUout
EX MEMID
AD
D
3.A
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 151CS 2214
Data Hazards Forwarding (Bypassing)
If we show the pipeline in our notation
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5/6 8 9 10
5/6 7 8 9
9 10 11 12 13 8 9 10 11 12
10 11 12
All
RA
W
The arrow from MEM to EX indicates forwarding
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 152CS 2214
Data Hazards Forwarding (Bypassing)
What can we do to eliminate the stall for the XOR ?
To eliminate the stall for the XOR we will employ forwarding from the WB stage to the EX stage (as you will see on the next slide) !
Because we see that if we allow the XOR to read the old value of R10 in clock period 5, it can get the new value of R10 in the beginning of the 6th clock period
In the 6th clock period, the new value of R2 is with the ADD in the WB stage on register 5.ALUout
We then forward the value from 5.ALUout to MUX3, bypassing 3.B
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID Stall EX MEM WB
IF Stall ID EX MEMStall IF ID EX MEM
Stall IF ID EXStall IF ID
1 2 3 4 5 6 7 8 9 10 400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 153CS 2214
Data Hazards Forwarding (Bypassing)
Now, there is no stall ! Note the short lines in clock period 6 that indicate that write-
in-first-half-read-in-the-second-half help eliminate the stall between the ADD and the SW
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEMIF ID EX MEM WB
IF ID EX MEMIF ID EX
1 2 3 4 5 6 7 8 9 10 400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 154CS 2214
Data Hazards Forwarding (Bypassing)
If we show the pipeline in our notation
There is no stall ! Note the short lines in clock period 6 that indicate that write-
in-first-half-read-in-the-second-half help eliminate the stall between the DADD and the SW
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5 6 7 85 6 7 8
7 8 9 10 116 7 8 9 10
8 9 10
All
RA
W
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 155CS 2214
Data Hazards Forwarding (Bypassing)
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEMIF ID EX
1 2 3 4 5 6 7 8 9 10
What if R10 is the first operand register, register Rs, in the R-format ?
Till now we considered this code where for the SUB, XOR and SW, R10 is the second operand register, i.e. register Rt in the R-format
400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R14, R10 40020C XOR R16, R17, R10 400210 SW R19, 0(R10)400214 OR R21, R22, R10400218 SLT R24, R25, R1040021C BEQ R27, R10, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 156CS 2214
Data Hazards Forwarding (Bypassing)
What if the code is that R10 is Rs for the SUB and XOR ? In this case we forward from 4.ALUout and 5.ALUout to a
new MUX, MUX2, bypassing 3.A
Only the SUB and XOR instructions will have the RAW hazard and the stall cycles will be eliminated by forwarding to MUX2
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEMIF ID EX MEM WB
IF ID EX MEMIF ID EX
1 2 3 4 5 6 7 8 9 10 400200 LW R8, 0(R9)
400204 ADD R10, R11, R12 400208 SUB R13, R10, R15 40020C XOR R16, R10, R18 400210 SW R19, 0(R20)400214 OR R10, R22, R23400218 SLT R10, R25, R2640021C BEQ R10, R28, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 157CS 2214
Data Hazards Forwarding (Bypassing)
What if the code is that R2 is Rs for the DSUB, XOR and SLT ? If we show the pipeline in our notation
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4 5 6 74 5 6 7 85 6 7 8 9
7 8 9 106 7 8 9 10
8 9 10
All
RA
W
400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R10, R15 40020C XOR R16, R10, R18 400210 SW R19, 0(R20)400214 OR R10, R22, R23400218 SLT R10, R25, R2640021C BEQ R10, R28, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 158CS 2214
Data Hazards EMY forwarding (Bypassing) for the general case
By using forwarding (bypassing) results that have not reached the destination GPR, can be forwarded to the inputs of
Functional units in the ALU in EX Memory port 2 in MEM
Bypassing the inputs that are shown in the Version 1 state diagram and datapath
Remember that we forward a value when it is needed
One exception is the Store instruction since it completes not in 5 but, 4 (soon we will see) !
Also, soon we will see that BEQ will complete in 2 clock periods and we will forward to ID
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 159CS 2214
Data Hazards What forwarding does is that functional
units in the ALU and memory port 2 bypass GPR registers
If they cannot get the new value of a GPR register on time, the new values are forwarded from
4.ALUout 5.ALUout 5.MDR
To the inputs of Functional units in the ALU Memory port 2
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 160CS 2214
Data Hazards Forwarding (Bypassing)
We show the changes to the inputs of the ALU below
MU
X2
MU
X3
3.B
3.Im
m
4.A
LUou
t3.A
5.A
LUou
t
EX 5.M
DR
MEM WB
AL
U
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 161CS 2214
Data Hazards Forwarding (Bypassing)
We show the changes to the inputs of Memory Port 2 below
4.ALUout
4.B
MemoryPort
2
5.ALUout
5.MDR
MUX5AB2
DB2 DB3
MEM
WB
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 162CS 2214
Data Hazards Forwarding (Bypassing)
We show the exception case for Store instructions where the value to be written to a memory location has to be passed to a Store in the EX stage even though it is not needed in EX, but in MEM
We have to have a new MUX in EX that will move data to 4.B either from 3.B or from 5.ALUout or 5.LMD
4.ALUout
4.B
5.ALUout
5.MDR
MEM WBEXM
UX
63.B
3.Im
m3.
A3.
NPC
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 163CS 2214
Data Hazards Forwarding (Bypassing)
In summary, we have the following changes to the EMY datapath for forwarding purposes
Three new multiplexers, MUX2, MUX5 and MUX6 MUX3 are larger
There will be additional forwarding hardware for the BEQ
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 164CS 2214
Data Hazards As we said before, there are three types of data
hazards Read after write, RAW
Instruction 1 has to write and then Instruction 2 has to read : I1W - I2R
We studied it on previous slides We need to prevent I2R - I1W So, we stall I2 unless we can forward the value We can do forwarding and write-in-the-first-half-read-in-
the-second-half to avoid the stall for all cases except one that involves Load instructions as described below
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 165CS 2214
Data Hazards There are three types of data hazards
Write after read, WAR Instruction 1 has to read and then Instruction 2 has to write : I1R -
I2W We need to prevent I2W - I1R So, we need to stall I2
This hazard cannot occur on EMY since all reads are early and all writes are late
This will happen when some instructions write early and some others read late An example is for an instruction that uses the autoincrement addressing mode
: ADD R8, (R9)+ This instruction does the following : R8 R8 + M[R9] then R9 R9 + 4 Often the CPU writes the new value of R9 in the MEM stage, not in the WB
stage, provided that there is a separate integer ADDer So, we write to R9 early, perhaps before a previous instruction can read it This instruction is a typical CISC instruction The example shows how the architecture complexity affects the hardware
design, in this case pipelining ! This hazard can always be prevented by changing the destination
register of the second instruction !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 166CS 2214
Data Hazards There are three types of data hazards
Write after write, WAW Instruction 1 has to write and then Instruction 2 has to write : I1W
- I2W We need to prevent I2W - I1W So, we need to stall I2 to prevent a wrong value on the destination
This hazard cannot occur on EMY since all reads are early and all writes are late
This will happen if more than one stage can write Allowing writes in different stages can result in two writes to a GPR in the
same clock period The previous example can cause a WAW hazard ADD R8, (R9)+ R8 R8 + M[R9] then R9 R9 + 4 The CPU writes the new value of R9 in the MEM stage, not in the WB stage So, we write to R9 early, perhaps when a previous instruction is also
writing to R9 at the same time This instruction is a typical CISC instruction The example shows how the architecture complexity affects the hardware
design, in this case pipelining ! This hazard can always be prevented by changing the destination
register of the second instruction !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 167CS 2214
Data Hazards There are three types of data hazards
WAR and WAW The WAR and WAW hazards will also happen when an
instruction is allowed to proceed even though the instruction in front of it is stalled
For example, with dynamic issuing, an instruction passes by a stalled instruction and so it writes too soon !
This is a topic to deal with in Computer Architecture II !
The fourth hazard ? Read after Read, RAR
This is not a hazard since no value is changed by the two readings
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 168CS 2214
Data Hazards Not all RAW hazard stalls can be eliminated via forwarding
and write-in-the-first-half-read-in-the=second-half Let’s consider our piece of mnemonic machine language
program again where there is now a dependency between the LW and the instructions that follow it
We observe that the LW writes to R8 and the instructions below LW read R8
The LW and the remaining instructions are executed close in time
Can there be data hazards among them ?
400200 LW R8, 0(R9)
400204 ADD R10, R11, R8 400208 SUB R13, R14, R8 40020C XOR R16, R17, R8 400210 SW R19, 0(R8)400214 OR R21, R22, R8400218 SLT R24, R25, R840021C BEQ R27, R8, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 169CS 2214
Data Hazards
The data element in R8 is shared by all the instructions below the LW and they are executed close in time
Yes, there are data dependencies, but are they all data hazards ? Will all the instructions below the LW try to read R8 before the LW writes ? Data hazards will be happen between the LW and ADD, SUB and XOR ADD, SUB and XOR will try to read R8 before the LW writes to R8 This data hazard is the RAW hazard We might have to stall ADD, SUB and XOR when they try to read R8 ???? The SW, OR, SLT and BEQ will read R8 after the LW writes to R8 They do not have any hazard situation !!!
RAW ?
RAW ?
RAW ?
RAW ?
RAW ?RAW ?
RAW ?
400200 LW R8, 0(R9)
400204 ADD R10, R11, R8 400208 SUB R13, R14, R8 40020C XOR R16, R17, R8 400210 SW R19, 0(R8)400214 OR R21, R22, R8400218 SLT R24, R25, R840021C BEQ R27, R8, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 170CS 2214
Data Hazards Do we have to stall ADD, SUB and XOR when they
try to read R8 ?
If yes, can we eliminate any possible stall by using forwarding ?
Yes, we can eliminate the data hazard stalls between the LW and SUB and XOR !
But, we cannot eliminate a stall cycle between the LW and ADD with forwarding and write-in-the-first-half-read-in-the-second-half
All RAW400200 LW R8, 0(R9)
400204 ADD R10, R11, R8 400208 SUB R13, R14, R8 40020C XOR R16, R17, R8 400210 SW R19, 0(R8)400214 OR R21, R22, R8400218 SLT R24, R25, R840021C BEQ R27, R8, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 171CS 2214
Data Hazards Why is that we cannot eliminate the stall cycle
between the LW and ADD ?
According to our state diagram, the LW reads the data from the memory in the MEM stage
This is clock period 4 The data will come from the memory at the end of the 4th
clock period since the memory takes one clock period to access
But, the ADD needs that data from the memory in the beginning of the 4th clock period
We need to stall the ADD and forward the data from WB to EX in the 5th clock period
400200 LW R8, 0(R9)
400204 ADD R10, R11, R8
RA
W IF ID EX MEM WBIF ID Stall EX MEM WB
1 2 3 4 5 6 7
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 172CS 2214
Data Hazards Why is that we cannot eliminate the stall cycle
between the LW and ADD ?
All
RA
W IF ID EX MEM WBIF ID Stall EX MEM WB
IF Stall ID EX MEM WBIF ID EX MEM WB
IF ID EX MEMIF ID EX MEM
IF ID EXIF ID
1 2 3 4 5 6 7 8 9 10
IFIDEX
MEMWB
v vv
vvv
v
v
?vvv
v
v
????
???
?? ?
v
vvv
vvvvv
vvvvv
vvvvv
v
400200 LW R8, 0(R9)
400204 ADD R10, R11, R8 400208 SUB R13, R14, R8 40020C XOR R16, R17, R8 400210 SW R19, 0(R8)400214 OR R21, R22, R8400218 SLT R24, R25, R840021C BEQ R27, R8, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 173CS 2214
Data Hazards Why is that we cannot eliminate the stall cycle
between the LW and ADD ? We see that the ADD is stalled to wait for the LW to read
the memory Where is the ADD stalled ? In the ID stage ? YES
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 174CS 2214
Data Hazards Why is that we cannot eliminate the stall cycle
between the LW and ADD ? As mentioned before we are checking for hazard
situations in the ID stage and when we recognize a hazard, we stall the instruction in the ID stage !
We have static issuing We stall the ADD due to its RAW hazard We stall the SUB, XOR and the others behind the ADD for
correct execution pattern
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 175CS 2214
Data Hazards Why is that we cannot eliminate the stall cycle
between the LW and ADD ? If we show the pipeline in our notation
Note the short lines in clock period 5 that indicate that write-in-first-half-read-in-the-second-half help eliminate the stall between the LW and the SUB
IF ID EX MEM WB1 2 3 4 52 3/4 5 6 7
3/4 5 6 7 85 6 7 8 96 7 8 9
8 9 10 11 127 8 9 10 11
9 10 11
All
RA
W 400200 LW R8, 0(R9)
400204 ADD R10, R11, R8 400208 SUB R13, R14, R8 40020C XOR R16, R17, R8 400210 SW R19, 0(R8)400214 OR R21, R22, R8400218 SLT R24, R25, R840021C BEQ R27, R8, 5
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 176CS 2214
Data Hazards Why is that we cannot eliminate the stall cycle between the
LW and ADD ? The stall can be avoided (the interlock for the LD situation can
be eliminated) if there was an independent instruction, an instruction that did not need R8 was placed between the LW and ADD
For the first time we have an example of the importance of ordering instructions carefully
If we had a compiler that guaranteed to find an independent instruction that does not depend on the LW, we would never have the Load interlock !
This is what we call the compiler scheduling an independent instruction
The instruction position following the LW is called load delay slot and the compiler fills the delay slot with an independent instruction
This is called delayed Load If the compiler cannot find an independent instruction, it inserts a
NOP in the delayed Load slot
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 177CS 2214
Data Hazards Why is that we cannot eliminate the stall
cycle between the LW and ADD ? If the compiler changes the order of
instructions to avoid stalls, to fill delay slots, then it is called pipeline scheduling or instruction scheduling
We will have more examples of how the compiler arranges the code for better pipeline efficiency throughout the semester
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 178CS 2214
Data Hazards Delayed Loads are not practical and not used !
If delayed Loads were used, the Load interlock in hardware is removed since it is guaranteed a Load is not followed by a depending instruction
We can guarantee removing the interlock will work only if it runs new code just compiled for the delayed Load CPU
But, there is a lot of software compiled years ago and the compilers did not take into account this delayed Load feature
The old code has a lot of LW instructions followed by depending instructions
If we ran them on a CPU with delayed Loads (no Load interlock) the depending instruction will get wrong data and programs will generate wrong results
This is the legacy software situation ! Our EMY CPU will not have delayed Loads !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 179CS 2214
Control Hazards Control hazards occur when a control instruction is executed
Control instructions are Jump Jump to a function Return from function Branch
Except the branch instruction, all control instructions change the order of execution
The branch may or may not change the order of execution depending on the condition
If the order of the execution is changed, the pipeline is emptied
That is, there is a pipeline start-up This results in a performance loss worse than the data hazard
performance loss Even worse that we may branch to an instruction that is not
in the memory This is a page-fault that results in millions of clock periods of
delay!
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 180CS 2214
Control Hazards Especially branches are troublesome
The order of execution may or may not be changed So, we do not know which instruction to fetch next
Which one to fetch depends on the test : equal to zero or not equal to zero ?
Note that besides comparing with zero, we also have to compute the possible branch address, the target address, the address of the target instruction
If these two are not performed early, there is a large control hazard penalty of three clock periods.
If the branch instruction does not change the order of execution, i.e. we continue with the instruction following the branch we say the branch is not taken
If the branch instruction changes the order of execution, i.e. we continue with the instruction pointed by the effective address we say the branch is taken
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 181CS 2214
Control Hazards If we recall what we did earlier
Branch instructions go through stages IF, ID and EX
They actually complete the execution back in stage IF
Therefore, CPIBranch = 4
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 182CS 2214
Control Hazards Let’s take a look at the code studied earlier
Assuming that we take the branch !
IFIDEX
MEMWB
vv
v
???v
??v
?v
v
11 2 3 4 5 6 7 8 9IF ID EX
Stall Stall Stall
IF ID EX MEM WB
????
???
?? ?
The Branch causesa pipeline start-up ! ?
???
A pipeline bubbleis generated
400600 BEQ R8, R9, 4 400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 183CS 2214
Control Hazards If we show the pipeline in our notation
Assuming that we take the branch !
We see that we have three stall cycles if the branch is taken
IF ID EX MEM WB1 2 3
5 6 7 8 9
400600 BEQ R8, R9, 4 400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 184CS 2214
Control Hazards Let’s take a look at the code studied earlier
Assuming that we do not take the branch !
IFIDEX
MEMWB
vv
v
vvvv
vvv
vv
v
1 2 3 4 5 6 7 8 9IF ID EX
Stall Stall Stall IF ID EX MEM WB
IF
????
???
?? ?
The Branch causesa pipeline start-up ! v
vvv
IF ID EX MEMIF ID EX
IF ID
A pipeline bubbleis generated
400600 BEQ R8, R9, 4 400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 185CS 2214
Control Hazards If we show the pipeline in our notation
Assuming that we do not take the branch !
Are we fetching the ADD in the 5th clock period ? If yes, why ?
IF ID EX MEM WB1 2 3 5 6 7 8 96 7 8 9 107 8 9 10 118 9 10 11 129 10 11 12 13
400600 BEQ R8, R9, 4 400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 186CS 2214
Control Hazards Assuming that we do not take the branch !
Why are we fetching the ADD in the 5th clock period ?
Can we fetch the ADD in the 2nd clock period ? The answer is yes, if the control unit allows the
completion of the fetch cycle of the ADD in the 2nd clock period
Then, the ADD stays on the 2.IR register until the end of 4th clock period then moves to the ID stage as will be shown soon
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 187CS 2214
Control Hazards Assuming that we do not take the branch !
But, if the control unit stops fetching of the ADD in the 2nd clock period to save itself from a memory access that might be unnecessary if the branch is taken, then the ADD must be fetched in the 5th clock period
Why would the control unit stop fetching the ADD in the 2nd clock period ?
We are asking this question because we know that decoding an instruction is very quick : Just checking the Opcode bits is enough for many instructions
Thus, the control unit would know right in the beginning of the 2nd clock period that there is a Branch in the ID stage, and we can get the ADD by the end of the 2nd clock period !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 188CS 2214
Control Hazards Assuming that we do not take the branch !
If the CPU designer decides to continue with the fetching of the ADD in the 2nd clock period
IFIDEX
MEMWB
v vv
v
vvvv
vvv
vv
v
1 2 3 4 5 6 7 8 9IF ID EX
IF Stall Stall ID EX MEM WB
IF ID
????
???
?? ?
vvvv
IF ID EX MEM WBIF ID EX MEM
IF ID EX
vv
vv v
400600 BEQ R8, R9, 4 400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 189CS 2214
Control Hazards Assuming that we do not take the branch !
If the CPU designer decide to continue with the fetching of the ADD in the 2nd clock period
If we show the pipeline in our notation
We save one clock period !
IF ID EX MEM WB1 2 3
2/4 5 6 7 85 6 7 8 96 7 8 9 107 8 9 10 118 9 10 11 12
400600 BEQ R8, R9, 4 400604 ADD R10, R11, R12400608 SUB R13, R14, R1540060C XOR R16, R17, R18400610 SLT R19, R20, R21400614 AND R22, R23, R24
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 190CS 2214
Control Hazards Assuming that we do not take the branch !
The CPU designer might decide to design the control unit so that it aborts the fetch of the ADD in the 2nd clock period
This is a toss up for the CPU designer ! How often the branches are not taken is critical If branches are not taken often, then the designer can
design the control unit to allow fetching the ADD BUT, if we go ahead with continuing with the fetch which
causes a page-fault (the instruction is not in the memory) and we read the page of the instruction from disk and then realize the branch is taken, all this effort will be wasted !
The frequency of untaken branches depends on the application, programmer, the compiler and the instruction set !
We decide not to fetch the next instruction We do not fetch the ADD !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 191CS 2214
Control Hazards If we summarize : If we have a control
instruction, the time penalty is high Jumps, jumps to a function and returns from a
function instructions require an unconditional change to the order of execution pattern
The sooner we calculate the target instruction address, the more stall cycles we can reduce
But, with branches we also need to test the condition so we need to determine two items
The target address The condition
The sooner we calculate the target instruction address and the condition, the more stall cycles we can save
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 192CS 2214
Control Hazards Thus, solving the branch execution
problem is more difficult than the others In fact, one can think of the jump, jump to a
function and return from a function instructions as a special case of the branch where the condition is always true, so we have to take the jump/return
Overall, control hazards, especially branch instructions, attract a lot of interest in computer architecture research
Many journal and conference papers last 20 plus years are published on the topic of branch penalty reduction !
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 193CS 2214
Control Hazards Let’s change our earlier code a little
If the Branch is not taken, the target instruction is the SUB, the instruction that follows the Branch
If the Branch is taken, the target instruction is the OR instruction that is two instructions below the instruction that follows the Branch (SUB)
400600 LW R8, 0(R9)400604 ADD R10, R11, R12400608 BEQ R13, R14, 3 40060C SUB R15, R16, R18400610 XOR R19, R20, R21400614 SLT R22, R23, R24400618 OR R25, R26, R2740061C SW R28, 0(R29)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 194CS 2214
Control Hazards Assuming that we do take the branch and do not
fetch the SUB !11 2 3 4 5 6 7 8 9 10 11
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX Stall Stall Stall
IF ID EX MEM WBIF ID EX MEM
IFIDEX
MEMWB
v vv
vvv
vvv
vv
vvv
vvvv
vvvvv
v
vvv
????
???
?? ? v
A pipeline start-up iscreated
400600 LW R8, 0(R9)400604 ADD R10, R11, R12400608 BEQ R13, R14, 3 40060C SUB R15, R16, R18400610 XOR R19, R20, R21400614 SLT R22, R23, R24400618 OR R25, R26, R2740061C SW R28, 0(R29)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 195CS 2214
Control Hazards For the case where we take the branch, we
have a pipeline start-up created in clock period 7
That is, the pipeline is emptied ! We need to improve the penalty cycles for
our pipeline We will modify our state diagram so that
Branch instructions will take two clock periods
Branch instructions will be in only IF and IDCPIBranch = 2
There will be only one clock period of stall after this implementation
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 196CS 2214
Control Hazards The changes on the state diagram for the Branch
instruction As we discussed before we need to determine the target
address and the condition as early as possible We would know we have a branch in the beginning of the
ID cycle In that case, we determine the target address and the
condition by using the information in the ID stage The target address calculation requires adding PC and
(4*Offset), for which the ID stage has an ADDer circuit now We can justify a separate ADDer in the ID stage, besides
the ones in IF and EX, since there is large Branch penalty to pay
The execution of all other non-control instructions is not affected
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 197CS 2214
Control Hazards The changes on the state diagram for the
Branch instruction
2.IR If (2.IR.opcode == BEQ) then NOP else M[PC]PC If ((2.IR.opcode == BEQ) & (GPR[2.IR.Rs] == GPR[2.IR.Rt])) then (2.NPC + (4 * 2.IR.DOImm+)) else if (2.IR.opcode ≠ BEQ) then PC + 42.NPC If ((2.IR.opcode == BEQ) & (GPR[2.IR.Rs] == GPR[2.IR.Rt])) then (2.NPC + (4 * 2.IR.DOImm+)) else if (2.IR.opcode ≠ BEQ) then PC + 4
0
IF
3.A GPR[2.IR.Rs]3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+
3.IR 2.IR
1
IDCPIBranch = 2
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 198CS 2214
Control Hazards The changes to the IF and ID stages
Equal ?
2.N
PC
IDIF
2.IR
SignExtend
5
GPR
16
Rs
MU
X1
32PC
AB1
AD
D AD
D
324
DOImm
GPR[Rs]
*4
GPR[Rt]5
Rt
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 199CS 2214
Control Hazards Final design of the ID stage with forwarding
2.N
PC
ID
2.IR
AD
D
SignExtend
16
32DOImm
*4
Equal ?5
GPRRs
GPR[Rs]
GPR[Rt]5
Rt
MU
X74.ALUout
5.ALUout5.MDR
MU
X8
4.ALUout5.ALUout
5.MDR
ALU in EX
ALU in EX
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 200CS 2214
Control Hazards The changes to the IF and ID stages
The ADDer in the IF stage is used by MUX1 in the IF stage
The Equal circuit has a forwarding circuit with MUX7 and MUX8
We have forwardings to the ID stage so that we bypass one or both GPR registers to test
These forwardings are from : The output of the ALU in EX 4.ALUoutput 5.ALUoutput 5.MDR
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 201CS 2214
Control Hazards The execution of the BEQ now
Assume that the Branch is taken11 2 3 4 5 6 7 8 9 10 11
IF ID EX MEM WBIF ID EX MEM WB
IF ID
IF ID EX MEM WBIF ID EX MEM
IFIDEX
MEMWB
v vv
vvv
vvv
?v
??v
???v
vv
vv
????
???
?? ? vA 1-clock period long
bubble is created. The other stall cycle is because the BEQ takes 2 clock periods v
v
vv
vv
vv
400600 LW R8, 0(R9)400604 ADD R10, R11, R12400608 BEQ R13, R14, 3 40060C SUB R15, R16, R18400610 XOR R19, R20, R21400614 SLT R22, R23, R24400618 OR R25, R26, R2740061C SW R28, 0(R29)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 202CS 2214
Control Hazards The execution of the Branch now
Assume that the Branch is taken If we show the pipeline in our notation
It looks like there is 2-clock period long bubble created on the previous slide
This is because the BEQ does not have its EX cycle anymore ! Overall, there is only one stall cycle now !
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4
6 7 8 9 105 6 7 8 9
400600 LW R8, 0(R9)400604 ADD R10, R11, R12400608 BEQ R13, R14, 3 40060C SUB R15, R16, R18400610 XOR R19, R20, R21400614 SLT R22, R23, R24400618 OR R25, R26, R2740061C SW R28, 0(R29)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 203CS 2214
Control Hazards Can we improve the BEQ hardware so
there is no one stall cycle ?YES !
SolutionWe will use delayed branches
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 204CS 2214
Control Hazards Delayed branches
Delayed branch makes use of the compiler and the hardware
In this technique, we continue the execution of the instruction(s) that follow(s) the Branch in the branch delay slot no matter what the Branch outcome is
The branch delay slot is the set of instruction positions following the branch
The length of the branch delay slot is the time penalty paid ≡ the number of stall cycles due to the Branch ≡ the amount of time we are not sure about the target instruction
For the current design it is 1 clock period Therefore, the branch delay slot has 1 instruction
Branch Rx, OffsetBranch delay slot One instruction long or more
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 205CS 2214
Control Hazards The changes on the state diagram due to delayed
branches We have to execute the instruction that follows the
branch in any case2.IR M[PC]PC If ((2.IR.opcode == BEQ) & (GPR[2.IR.Rs] == GPR[2.IR.Rt])) then (PC + (4 * 2.IR.DOImm+)) else PC + 4
0
IF
3.A GPR[2.IR.Rs]3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+
3.IR 2.IR
1
ID
CPIBranch = 2
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 206CS 2214
Control Hazards Delayed Branches
We execute the instructions in the branch delay slot no matter what the Branch outcome is
These instructions must be independent of the branch so that the program execution is correct !
For our EMY CPU the branch delay slot is one instruction long
Because, we are not sure which instruction is the target instruction for one clock period
The following clock period we know which instruction is the target
Thus, we execute the instruction right after the Branch whether we take the branch or not ?
It should be easy to find one instruction that can be executed no matter what the Branch outcome is ????
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 207CS 2214
Control Hazards Delayed Branches
We execute the instructions in the branch delay slot no matter what the Branch outcome is
It is the compiler that changes the order of instructions so that after the Branch there is an independent instruction
We say the compiler schedules an instruction to the Branch delay slot
This is another example of how ordering instructions is important (needed)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 208CS 2214
Control Hazards Delayed Branches
We execute the instructions in the branch delay slot no matter what the Branch outcome is
How can the compiler find an independent instruction for the EMY CPU to place in the Branch delay slot ?
There are three possible cases that a compiler looks for Case 1 : From before branch
If the instruction before the Branch is independent of the Branch
This one always improves the performance :
Original code
ADD R8, R9, R10BEQ R11, R12, 5
New code
BEQ R11, R12, 6ADD R8, R9, R10
The compiler realizes the ADDis independent of the BEQ ≡ The ADD can be executed after theBEQ. The compiler moves theADD after the BEQ
Branch delay slot
Note the change of offset for the BEQ
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 209CS 2214
Control Hazards Delayed Branches
We execute the instructions in the branch delay slot no matter what the Branch outcome is
Case 2 : From target branch It is used for loops where there is a large probability that the branch will be taken
(many times) It improves the performance if the branch is taken
Original code
SUB R8, R9, R10
----ADD R11, R12, R13BEQ R11, R14, (-9)10
loop :
The compiler realizes the ADDis not independent of the BEQ. But, the SUB is independent of The BEQ ≡ The SUB can be executed after theBEQ. The compiler moves the SUB to the Brach delay slot. This will save time if we branch back to the beginning of the loop. If we exit the loop, it must be OK to execute the SUB ! Branch offset must be adjusted ! The code is longer !
New code
SUB R8, R9, R10
----ADD R11, R12, R13BEQ R11, R14, (-8)10
SUB R8, R9, R10
loop :
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 210CS 2214
Control Hazards Delayed Branches
We execute the instructions in the branch delay slot no matter what the Branch outcome is
Case 3 : From fall through It is used when there is a high probability that the branch will not
be taken It improves the performance if the branch is not taken
Original code
ADD R8, R9, R10BEQ R8, R11, 7
SUB R12, R13, R14
The compiler realizes the ADD is not independent of the BEQ. But, the SUB is independent of the Branch ≡ The SUB can be executed right after the BEQ. The compiler moves the SUB to the Branch delay slot. This will save time if the branch is not taken. It must be OK to execute the SUB even if we take the branch !
New code
ADD R8, R9, R10BEQ R8, R11, 7SUB R12, R13, R14
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 211CS 2214
Control Hazards Delayed Branches
We execute the instructions in the branch delay slot no matter what the Branch outcome is You might have realized that delayed branch is not
practical since it requires the compiler to know that the CPU is expecting an independent instruction in the Branch delay slot This means that old code cannot be run on this EMY CPU
either because that 1. Compiler did not generate the code for such a CPU with a Branch
delay slot 2. Compiler did generate a code with a Branch delay slot, but the delay
slot was more than one instruction since it was an old generation EMY CPU
This is the legacy software situation ! Today’s microprocessors do not use delayed branches
because of the compatibility issue However, academically, it is an interesting idea
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 212CS 2214
Control Hazards Delayed Branches
We execute the instructions in the branch delay slot no matter what the Branch outcome is
Shall we not use delayed branches for the EMY CPU ? We will use delayed Branches in Version 1 for the
sake of simplifying our discussion We will not use delayed branches in Computer
Architecture II
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 213CS 2214
Control Hazards Delayed Branches
Let’s take a look at the execution of the following code with a taken branch
Notice the SUB is an independent instruction in the branch delay slot
It must be OK to execute to execute the SUB even if we take the branch
Notice we changed the BEQ register to R10 to show forwarding to the ID stage
The forwarding is from the EX stage to the ID stage where the output of the ALU is forwarded to the ID stage to bypass GPR[Rs] of the BEQ which is R10
IF ID EX MEM WB1 2 3 4 52 3 4 5 63 4
4 5 6 7 8
6 7 8 95 6 7 8 9
400600 LW R8, 0(R9)400604 ADD R10, R11, R12400608 BEQ R10, R14, 3 40060C SUB R15, R16, R18400610 XOR R19, R20, R21400614 SLT R22, R23, R24400618 OR R25, R26, R2740061C SW R28, 0(R29)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 214CS 2214
Summary of Version 1 We added hardware to deal with structural, data and
control hazards Still, it executes integer instructions It issues instructions statically
Except for the branch which is not issued and completed in two clock periods
The branch is not issued to save time !
Because of static issuing instructions complete in-order, except for the branch which can complete before the instructions that are issued earlier
This results in imprecise interrupts !
IF ID EX MEM WBStaticInstruction
issue
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 215CS 2214
Summary of Version 1 We realize we need to modify Version 1 so that it
Executes FP instructions The memory is not ideal Has precise interrupts
All three are difficult problems to solve FP operations, such as add, subtract, multiply and divide are
complex and cannot be completed in one clock period as we can with integer add operation
The integer add is done in EX and takes one clock period The FP add, subtract, multiply and divide will be done in EX and take
multiple clock cycles ! More instructions can complete out-of-order The interrupt hardware becomes even more complex We solve one problem (executing FP instructions) but made the
other problem more complex The complete memory hierarchy must be considered
The cache memories, slower main memory and the virtual memory (disk)
Interrupts can happen randomly We also need to save the state which is not easy for a pipelined CPU
Advanced versions in Computer Architecture II will solve them
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 216CS 2214
Test Program Determine when the execution of the second iteration ends
if L1 cache memories take one clock period and there is no cache miss
Show all forwardings and write-in-the-first-half-read-in-the-second-half cases
IF ID EX MEM WB IF ID EX MEM WBLW R8, 0(R9)ADD R10, R11, R8SUB R11, R10, R8XOR R9, R11, R10SLT R12, R10, R11OR R13, R12, R14BNE R13, (-7)10
SW R12, 0(R13)
The answer is on the next slide
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 217CS 2214
Test Program Determine when the execution of the second iteration ends
if L1 cache memories take one clock period and there is no cache miss
Show all forwardings and write-in-the-first-half-read-in-the-second-half cases
IF ID EX MEM WB IF ID EX MEM WB1 2 3 4 5 10 11 12 13 142 3/4 5 6 7 11 12/13 14 15 16
3/4 5 6 7 8 12/13 14 15 16 175 6 7 8 9 14 15 16 17 186 7 8 9 10 15 16 17 18 19
8 9 17 187 8 9 10 11 16 17 18 19 20
9 10 11 12 18 19 20 21
All hazards are RAW The second iteration ends in clock period 21
LW R8, 0(R9)ADD R10, R11, R8SUB R11, R10, R8XOR R9, R11, R10SLT R12, R10, R11OR R13, R12, R14BNE R13, (-7)10
SW R12, 0(R13)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 218CS 2214
Test Program Determine when the execution of the second iteration ends
if L1 cache memories take two clock periods and there is no cache miss
Show all forwardings and write-in-the-first-half-read-in-the-second-half cases
IF ID EX MEM WB IF ID EX MEM WB1-2 3 4 5-6 7 17-18 19 20 21-22 233-4 5/6 7 8 9 19-20 21/22 23 24 255-6 7 8 9 10 21-22 23 24 25 267-8 9 10 11 12 23-24 25 26 27 289-10 11 12 13 14 25-26 27 28 29 30
13-14 15 29-30 3111-12 13 14 15 16 27-28 29 30 31 32
15-16 17 18 19-20 31-32 33 34 35-36
All hazards pointed by the arrows are data hazards and type RAW
The second iteration ends in clock period 36There are structural hazards in IF and MEM stages due to slow cache memories
LW R8, 0(R9)ADD R10, R11, R8SUB R11, R10, R8XOR R9, R11, R10SLT R12, R10, R11OR R13, R12, R14BNEZ R13, (-7)10
SW R12, 0(R13)
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 219CS 2214
Test Program Determine when the execution of the second iteration ends if
L1 cache memories have misses Assume that the memory levels are as described in the
unpipelined CPU case with the following additions and reminders The physical memory has 4 Bytes per location The bus width between the physical and lowest level cache is 4
Bytes The instruction cache is 8KBytes The data cache is 16KBytes Both cache block sizes are 32 bytes Both cache memories use direct mapping Both caches use write-back with write-allocate Both cache memories access the needed item first The Data Cache has two read and two write ports The Instruction Cache has two read ports The latency to access the L2 cache is 4 clock periods and
transferring a 4-Byte content is one clock period each The L2 cache memory can handle one miss per L1 cache memory at
a time This means that if the instruction cache and the data cache have misses
at the same time, they will be handled at the same time by the L2 cache This means the L2 cache can handle two hits at the same time
Haldun Hadimioglu CSE – Spring
2014
Test Program Determine when the execution of the second iteration ends
if L1 cache memories have misses Assume that the L1 instruction and data cache memories and
the physical memory have the following properties Each Level 1 cache memory can handle only one miss at a time A Store miss requires that the Store instruction stays in the MEM
stage until the miss is handled It just cannot store to the write buffer and then proceed
Each Level 1 cache memory can handle up to four hits while it handles a miss
An instruction that immediately follows a Load or a Store is forced to stall an extra clock period in the ID stage to make sure the access for the data element is completed
For the given code, assume the following Each data element accessed is to a separate data block all of
which do not map to the same area in data cache It means each Load and Store instruction accesses a different block in
each iteration This means there will be four data cache misses in two iterations ! This is very unusual but, it is assumed here just to show an extreme case
EMY CPU Version 1 220CS 2214
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 1 221CS 2214
Test Program We observe all 8 instructions are in one instruction cache block There are four data accesses, each one is in one separate data
block, resulting in four data cache misses Determine when the execution of the second iteration ends Show all forwardings and write-in-the-first-half-read-in-the-
second-half casesIF ID EX MEM WB IF ID EX MEM WB1/5 6 7 8/12 13 18 19/24 25 26/30 316 7/12 13 14 15 19/24 25/30 31 32 33
7/12 13 14 15 16 25/30 31 32 33 3413 14 15 16 17 31 32 33 35 3614 15 16 17 18 32 33 34 35 36
16 17 34 35 15 16 17 18 19 33 34 35 36 37
17 18 19 20/24 35 36 37 38/42
All hazards pointed by the arrows are data hazards and type RAW
The second iteration ends in clock period 42There are structural hazards in IF and MEM stages due to cache misses
LW R8, 0(R9)ADD R10, R11, R8SUB R11, R10, R8XOR R9, R11, R10SLT R12, R10, R11OR R13, R12, R14BNE R13, (-7)10
SW R12, 0(R13)