Author
others
View
9
Download
1
Embed Size (px)
CSIS1120A14. Processor Organization
CSIS1120A 14. Processor Organization 1
Processor Organization
The processor performs operations via two operations:1 moving data from one place to another2 perform data transformation through ALU.
The data are stored in registers inside the CPU.These operations are controlled by ”control signals”generated by the control units. (The flags set in theinstruction decoder function in the simulator in Assignment2/4).Data movement can be described as follows:e.g. Instruction fetch:
MAR ← PC
PC ← PC+4
CSIS1120A 14. Processor Organization 2
Processors
Typical internal structure of CPU
CSIS1120A 14. Processor Organization 3
Registers
Registers are divided into 2 types:user-visible registers — general purpose registers, otherregisters such as index registers, base pointers, etc.Hidden registers — PC, IR, MAR, MBR, TEMP, PSW(processor status word, or flag register) etc.
CSIS1120A 14. Processor Organization 4
Control Signals
Data movement is controlled by control signals:
e.g. MAR← PC
This operation requires 2 control signals:1 A control signal to tell the PC to put its content onto the
CPU bus (In terms of digital logic circuits: to enable theoutput driver of PC)
2 A control signal to tell the MAR to get the value from theCPU bus (In terms of digital logic circuits: to clock the latchso that it will store its input into the register)
Other operations may require more control signals, e.g.Control signals to tell the ALU what operation to perform,control signals to read/write external memory etc.
CSIS1120A 14. Processor Organization 5
Hardwired Control vs MicroProgrammed Control
We have two ways to get the control signals.Hardwired Control
the control signals are generated by logic gates.e.g. write down the truth table, and use logic gates.It is faster, but the design is more complicated (nowadaysuse CAD).
Microprogrammed ControlThe control signals are stored in memory — microcodes.The entire truth table is stored in memory, and the inputsare the addresses of memory.Even though the control store is inside the CPU, memoryaccess is still slower.However, simple design and implementation, easy tomodify/debug.
Modern Processor tends to use hardwired control, and bysimplifying the instruction set, we can make the logicsimpler.
CSIS1120A 14. Processor Organization 6
Instruction Execution Cycle
There are 3 stages in an instruction execution cycle:1 Instruction Fetch2 Instruction Decode3 Instruction Execution:
Read Operands from registersALU operationsMemory ReferenceBranch CompletionWrite Back Registers
They are similar to content of the simulator in Assignment.
CSIS1120A 14. Processor Organization 7
Example
ADD A, B, C
where A, B, C are of absolute (direct) addressing, i.e. theaddresses are in the words following the instruction. This is a4-word instruction.
Instruction Fetch
IF: MAR ← PCPC ← PC+4IR ← mem[MAR]
CSIS1120A 14. Processor Organization 8
Example: ADD A, B, C
Instruction Execute
OF: MAR ← PC # operand fetchPC ← PC+4MBR ← mem[MAR] # addr of A in MBRMAR ← MBRMBR ← mem[MAR] # value of A in MBR
ALU.input1 ← MBR
MAR ← PCPC ← PC+4MBR ← mem[MAR] # addr of B in MBRMAR ← MBRMBR ← mem[MAR] # value of B in MBR
ALU.input2 ← MBR
CSIS1120A 14. Processor Organization 9
Example: ADD A, B, C
Exec: ALU.ouput ← ALU.input1 + ALU.input2
WB: MAR ← PC # write back result
PC ← PC+4MBR ← mem[MAR] # addr of C in MBR
MAR ← MBRMBR ← ALU.output # value of C in MBR
mem[MAR] ← MBR # memory write
CSIS1120A 14. Processor Organization 10
Instruction Pipeline
to increase the throughput of instruction execution.We assume only the LD/ST instructions uses memory, allother instructions uses registers.Divide the execution process into different stages, forexample:
1 Instruction Fetch (IF)2 Instruction Decode (ID) [operand fetch from register]3 Instruction Execution (EX)4 Memory Operation (MEM)5 Write back (register files) (WB)
CSIS1120A 14. Processor Organization 11
Instruction Pipeline
CSIS1120A 14. Processor Organization 12
Instruction Pipeline
CSIS1120A 14. Processor Organization 13
Pipeline
Note that not all instruction require all stages, e.g. onlyLOAD and STORE require Memory operation.The resources (hardware) required by each stages shouldnot overlap with each other, otherwise duplicate resourceis needed:
dedicated incrementer for PC instead of using ALU.Separate read port for data and instruction (IF vs MEM)multiple internal buses instead of single bus.
For an ideal pipeline, although each instruction still need togo through all 5 stages (i.e. 5 clock cycles) the throughputis 1 instruction/clock cycle.
CSIS1120A 14. Processor Organization 14
Pipeline Hazards
Pipeline Hazards are the situations that prevent the nextinstruction from entering the pipeline (for execution).Three classes of pipeline hazards:
1 structural hazard — arises from resource conflict, e.g. Ifonly one multiplier unit available, and current and nextinstruction both require multiply, the next instruction mustwait for previous multiply to finish, or when PC incrementand ALU operation both requires the ALU, there will beresource conflict.
2 Data hazard — instruction waiting for the result of previousinstructions. (data dependency)
3 Control hazard — arises from branches or otherinstructions that changes the PC (e.g. CALL/JSR, jumpsubroutine). We don’t know where to continue until thebranch instruction finishes.
CSIS1120A 14. Processor Organization 15
Structural Hazard
For example, if only one memory read port is available,then instruction fetch and memory read/write cannot beoverlapped. (Note that not all instruction require memoryread/write).For LOAD/STORE, then IF of the instruction that overlapswith MEM need to stop and wait.or, the increment of PC cannot be overlapped with ALUoperations, if both requires ALU.Solution: Add more resources, e.g. Dedicated incrementerfor PC, separate data/instruction cache, so that IF and MEMstage will use different hardware.
CSIS1120A 14. Processor Organization 16
Structural Hazard
CSIS1120A 14. Processor Organization 17
Data Hazard
Example:ADD R2, R3, R1SUB R1, R5, R5
source operand R1 of the SUB instructions depends on theresult of previous instruction.See the figure at next stage, where operand fetch is doneat the same time as instruction decoding (operandpre-fetching).
CSIS1120A 14. Processor Organization 18
Data Hazard
CSIS1120A 14. Processor Organization 19
Solution to Data Hazard
Can use hardware techniquesThe result is already at the output of the ALU by the end ofEX cycle of the ADD instruction.can forward the result to the ALU input for the EX of nextinstructionThis is called data forwarding.
Still problem with LD instruction, where the data isavailable only at MEM stage.By software techniques
Change the execution sequence, by inserting unrelatedinstruction in between, e.g.
ADD R11, R12, R13 ADD R2, R3, R1ADD R14, R15, R16 ADD R11, R12, R13ADD R2, R3, R1 ADD R14, R15, R16SUB R1, R5, R5 SUB R1, R5, R5
CSIS1120A 14. Processor Organization 20
Data Hazard from LD instruction
CSIS1120A 14. Processor Organization 21
Control Hazard
The IF stage of the next instruction cannot start until thebranch is resolved.the CPU must wait, and there is no other way to resolve it.One can use Branch Prediction.
On a branch instruction, we continue execution along oneof the two paths.The execution can always be discarded if nothing has beenchanged (i.e. write back to register/memory).50% of the time we will choose the correct path.In some situation, e.g. branches corresponding to for-loop,we will have even higher correct prediction (because in afor-loop, most of the time, we will branch back).
CSIS1120A 14. Processor Organization 22
Control Hazard
CSIS1120A 14. Processor Organization 23
Control Hazard with Simple Branch Prediction
CSIS1120A 14. Processor Organization 24
Dynamic Branch Prediction
Current prediction depends on previous prediction.Rationale: For example, in a for-loop, usually, we willpredict branch taken. However, for the last iteration, thebranch is not taken. The next time, branch is taken again.Require two consecutive wrong prediction to changedecision.
CSIS1120A 14. Processor Organization 25
Delayed Branch
To avoid control hazard, the instruction following thebranch instruction is always executed.Branch only after executing the following instruction.This is employed in MIPS CPU.The compiler will add a NOP instruction after the branch,which is always correct.Then by instruction rescheduing, try to find a usefulinstruction to put in the delay slot.
CSIS1120A 14. Processor Organization 26
Delayed Branch
CSIS1120A 14. Processor Organization 27
Instruction Scheduling in Delayed Branch
CSIS1120A 14. Processor Organization 28
Processor Performance
Execution Time = Instruction Count × CPI × ClockCycle TimeCPI = Clock per instruction.Clock rate not the only factor, when comparing twoprocessors.You have to compare how many instruction is needed toachieve a function, and how many clock cycle aninstruction requires.
CSIS1120A 14. Processor Organization 29
Processor Performance
exploit more effective instruction pipelining.use multiple instruction execution unit.large register file — reduce memory access, henceincreasing speedsimplified instruction set — reduces or eliminates the needfor microprograms, which is slower than hardwiredimplementation
CSIS1120A 14. Processor Organization 30
Characteristics of Modern Processors
All instructions are register-register type except LOAD andSTORE (which access memory)Fixed length and simple, fixed format instructions that donot cross main memory word boundary — for example, wecan start fetching register operands if they are always atthe same place within the instruction and discard the readif there are no operands.Relatively few operations and addressing modes — SimpleCPU, simpler CPU implementation, faster clock rate. Also,the pipeline can be designed more efficiently.Use of hardwired rather than microprogrammed control.Use of instruction pipelining and extensive software &hardware techniques to eliminate pipeline stall (pipelinehazards).These are concepts of Reduced Instruction Set Computers(RISC).
CSIS1120A 14. Processor Organization 31
Performance Improvement by using Registers
ADD A, B, C LD A, R1LD B, R2ADD R1, R2, R3ST R3, C
Number of Memory Access = 3 (in both cases)
ADD A, B, C LD A, R1ADD D, C, F LD B, R2ADD B, F, C ADD R1, R2, R3
LD D, R4ADD R4, R3, R5ADD R2, R5, R3ST R3, CST R5, F
Number of Memory Access is 9 vs 5.CSIS1120A 14. Processor Organization 32
Performance Issue
Simple Instruction Set, and hardwired logic⇒ high clock rate (low clock cycle time)Extensive Pipelining⇒ low CPISimple instruction⇒ high instruction countHowever, empirical studies shows that the increase ininstruction count is usually very small, e.g. 20% even usingsimple instruction.Hence there is an overall improvement in performance
CSIS1120A 14. Processor Organization 33
Reduced Instruction Set Computer (RISC)Background
Advances in Computer Technologyfully utilize the large amount of registers.rearrangement of instructions to reduce pipeline stall.
Standardized Operating System and EnvironmentsUnix/Linux Operating System with source codeX-windows as standard GUIcan afford to remove compatibility with older instruction set,can easily port OS, GUI and application software.
CSIS1120A 14. Processor Organization 34
Relative Code Size
Three studies were performed to compare the number ofinstructions executed in different sets of C programs fordifferent processors, including RISC I processor (reducedinstruction set computer), against complex instruction setprocessors (VAX-11, Motorala M68000, and Zilog Z8002).
Processors 11 C Prog 12 C Prog 5 C ProgRISC I 1.0 1.0 1.0
VAX 11/780 0.8 0.67M68000 0.9 0.9Z8002 1.2 1.12
The instruction count only increases by 20% – 30%.CPI reduced by 2-4 times. (e.g. VAX-11/780 has a CPI of8)Clock cycle time is also reduced.
CSIS1120A 14. Processor Organization 35
Characteristics of RISC
All instructions are register-register type except LOAD andSTORE (which access memory)Fixed length and simple, fixed format instructions that donot cross main memory word boundary.Relatively few operations and addressing modes.Use of hardwired rather than microprogrammed control.Use of instruction pipelining and extensive software andhardware techniques to eliminate pipeline disruption.Rely on optimizing compiler to enhance systemperformance.RISC is a design philosophy where performance of CPU isenhanced.Nowadays, usually we will see some kind of hybrid design.
CSIS1120A 14. Processor Organization 36