CSIS1120A - 14. Processor Organizationi.cs.hku.hk/~kpchan/CSIS1120A/Notes/14.processor.pdf · Processor Organization The processor performs operations via two operations: 1 moving

CSIS1120A14. Processor Organization

CSIS1120A 14. Processor Organization 1

Processor Organization

The processor performs operations via two operations:1 moving data from one place to another2 perform data transformation through ALU.

The data are stored in registers inside the CPU.These operations are controlled by ”control signals”generated by the control units. (The flags set in theinstruction decoder function in the simulator in Assignment2/4).Data movement can be described as follows:e.g. Instruction fetch:

MAR ← PC

PC ← PC+4


Processors

Typical internal structure of CPU


Registers

Registers are divided into 2 types:user-visible registers — general purpose registers, otherregisters such as index registers, base pointers, etc.Hidden registers — PC, IR, MAR, MBR, TEMP, PSW(processor status word, or flag register) etc.


Control Signals

Data movement is controlled by control signals:

e.g. MAR← PC

This operation requires 2 control signals:1 A control signal to tell the PC to put its content onto the

CPU bus (In terms of digital logic circuits: to enable theoutput driver of PC)

2 A control signal to tell the MAR to get the value from theCPU bus (In terms of digital logic circuits: to clock the latchso that it will store its input into the register)

Other operations may require more control signals, e.g.Control signals to tell the ALU what operation to perform,control signals to read/write external memory etc.


Hardwired Control vs MicroProgrammed Control

We have two ways to get the control signals.Hardwired Control

the control signals are generated by logic gates.e.g. write down the truth table, and use logic gates.It is faster, but the design is more complicated (nowadaysuse CAD).

Microprogrammed ControlThe control signals are stored in memory — microcodes.The entire truth table is stored in memory, and the inputsare the addresses of memory.Even though the control store is inside the CPU, memoryaccess is still slower.However, simple design and implementation, easy tomodify/debug.

Modern Processor tends to use hardwired control, and bysimplifying the instruction set, we can make the logicsimpler.


Instruction Execution Cycle

There are 3 stages in an instruction execution cycle:1 Instruction Fetch2 Instruction Decode3 Instruction Execution:

Read Operands from registersALU operationsMemory ReferenceBranch CompletionWrite Back Registers

They are similar to content of the simulator in Assignment.


Example

ADD A, B, C

where A, B, C are of absolute (direct) addressing, i.e. theaddresses are in the words following the instruction. This is a4-word instruction.

Instruction Fetch

IF: MAR ← PCPC ← PC+4IR ← mem[MAR]


Example: ADD A, B, C

Instruction Execute

OF: MAR ← PC # operand fetchPC ← PC+4MBR ← mem[MAR] # addr of A in MBRMAR ← MBRMBR ← mem[MAR] # value of A in MBR

ALU.input1 ← MBR

MAR ← PCPC ← PC+4MBR ← mem[MAR] # addr of B in MBRMAR ← MBRMBR ← mem[MAR] # value of B in MBR

ALU.input2 ← MBR


Example: ADD A, B, C

Exec: ALU.ouput ← ALU.input1 + ALU.input2

WB: MAR ← PC # write back result

PC ← PC+4MBR ← mem[MAR] # addr of C in MBR

MAR ← MBRMBR ← ALU.output # value of C in MBR

mem[MAR] ← MBR # memory write


Instruction Pipeline

to increase the throughput of instruction execution.We assume only the LD/ST instructions uses memory, allother instructions uses registers.Divide the execution process into different stages, forexample:

1 Instruction Fetch (IF)2 Instruction Decode (ID) [operand fetch from register]3 Instruction Execution (EX)4 Memory Operation (MEM)5 Write back (register files) (WB)






Pipeline

Note that not all instruction require all stages, e.g. onlyLOAD and STORE require Memory operation.The resources (hardware) required by each stages shouldnot overlap with each other, otherwise duplicate resourceis needed:

dedicated incrementer for PC instead of using ALU.Separate read port for data and instruction (IF vs MEM)multiple internal buses instead of single bus.

For an ideal pipeline, although each instruction still need togo through all 5 stages (i.e. 5 clock cycles) the throughputis 1 instruction/clock cycle.


Pipeline Hazards

Pipeline Hazards are the situations that prevent the nextinstruction from entering the pipeline (for execution).Three classes of pipeline hazards:

1 structural hazard — arises from resource conflict, e.g. Ifonly one multiplier unit available, and current and nextinstruction both require multiply, the next instruction mustwait for previous multiply to finish, or when PC incrementand ALU operation both requires the ALU, there will beresource conflict.

2 Data hazard — instruction waiting for the result of previousinstructions. (data dependency)

3 Control hazard — arises from branches or otherinstructions that changes the PC (e.g. CALL/JSR, jumpsubroutine). We don’t know where to continue until thebranch instruction finishes.


Structural Hazard

For example, if only one memory read port is available,then instruction fetch and memory read/write cannot beoverlapped. (Note that not all instruction require memoryread/write).For LOAD/STORE, then IF of the instruction that overlapswith MEM need to stop and wait.or, the increment of PC cannot be overlapped with ALUoperations, if both requires ALU.Solution: Add more resources, e.g. Dedicated incrementerfor PC, separate data/instruction cache, so that IF and MEMstage will use different hardware.


Structural Hazard


Data Hazard

Example:ADD R2, R3, R1SUB R1, R5, R5

source operand R1 of the SUB instructions depends on theresult of previous instruction.See the figure at next stage, where operand fetch is doneat the same time as instruction decoding (operandpre-fetching).


Data Hazard


Solution to Data Hazard

Can use hardware techniquesThe result is already at the output of the ALU by the end ofEX cycle of the ADD instruction.can forward the result to the ALU input for the EX of nextinstructionThis is called data forwarding.

Still problem with LD instruction, where the data isavailable only at MEM stage.By software techniques

Change the execution sequence, by inserting unrelatedinstruction in between, e.g.

ADD R11, R12, R13 ADD R2, R3, R1ADD R14, R15, R16 ADD R11, R12, R13ADD R2, R3, R1 ADD R14, R15, R16SUB R1, R5, R5 SUB R1, R5, R5


Data Hazard from LD instruction


Control Hazard

The IF stage of the next instruction cannot start until thebranch is resolved.the CPU must wait, and there is no other way to resolve it.One can use Branch Prediction.

On a branch instruction, we continue execution along oneof the two paths.The execution can always be discarded if nothing has beenchanged (i.e. write back to register/memory).50% of the time we will choose the correct path.In some situation, e.g. branches corresponding to for-loop,we will have even higher correct prediction (because in afor-loop, most of the time, we will branch back).


Control Hazard


Control Hazard with Simple Branch Prediction


Dynamic Branch Prediction

Current prediction depends on previous prediction.Rationale: For example, in a for-loop, usually, we willpredict branch taken. However, for the last iteration, thebranch is not taken. The next time, branch is taken again.Require two consecutive wrong prediction to changedecision.


Delayed Branch

To avoid control hazard, the instruction following thebranch instruction is always executed.Branch only after executing the following instruction.This is employed in MIPS CPU.The compiler will add a NOP instruction after the branch,which is always correct.Then by instruction rescheduing, try to find a usefulinstruction to put in the delay slot.


Delayed Branch


Instruction Scheduling in Delayed Branch


Processor Performance

Execution Time = Instruction Count × CPI × ClockCycle TimeCPI = Clock per instruction.Clock rate not the only factor, when comparing twoprocessors.You have to compare how many instruction is needed toachieve a function, and how many clock cycle aninstruction requires.


Processor Performance

exploit more effective instruction pipelining.use multiple instruction execution unit.large register file — reduce memory access, henceincreasing speedsimplified instruction set — reduces or eliminates the needfor microprograms, which is slower than hardwiredimplementation


Characteristics of Modern Processors

All instructions are register-register type except LOAD andSTORE (which access memory)Fixed length and simple, fixed format instructions that donot cross main memory word boundary — for example, wecan start fetching register operands if they are always atthe same place within the instruction and discard the readif there are no operands.Relatively few operations and addressing modes — SimpleCPU, simpler CPU implementation, faster clock rate. Also,the pipeline can be designed more efficiently.Use of hardwired rather than microprogrammed control.Use of instruction pipelining and extensive software &hardware techniques to eliminate pipeline stall (pipelinehazards).These are concepts of Reduced Instruction Set Computers(RISC).


Performance Improvement by using Registers

ADD A, B, C LD A, R1LD B, R2ADD R1, R2, R3ST R3, C

Number of Memory Access = 3 (in both cases)

ADD A, B, C LD A, R1ADD D, C, F LD B, R2ADD B, F, C ADD R1, R2, R3

LD D, R4ADD R4, R3, R5ADD R2, R5, R3ST R3, CST R5, F

Number of Memory Access is 9 vs 5.CSIS1120A 14. Processor Organization 32

Performance Issue

Simple Instruction Set, and hardwired logic⇒ high clock rate (low clock cycle time)Extensive Pipelining⇒ low CPISimple instruction⇒ high instruction countHowever, empirical studies shows that the increase ininstruction count is usually very small, e.g. 20% even usingsimple instruction.Hence there is an overall improvement in performance


Reduced Instruction Set Computer (RISC)Background

Advances in Computer Technologyfully utilize the large amount of registers.rearrangement of instructions to reduce pipeline stall.

Standardized Operating System and EnvironmentsUnix/Linux Operating System with source codeX-windows as standard GUIcan afford to remove compatibility with older instruction set,can easily port OS, GUI and application software.


Relative Code Size

Three studies were performed to compare the number ofinstructions executed in different sets of C programs fordifferent processors, including RISC I processor (reducedinstruction set computer), against complex instruction setprocessors (VAX-11, Motorala M68000, and Zilog Z8002).

Processors 11 C Prog 12 C Prog 5 C ProgRISC I 1.0 1.0 1.0

VAX 11/780 0.8 0.67M68000 0.9 0.9Z8002 1.2 1.12

The instruction count only increases by 20% – 30%.CPI reduced by 2-4 times. (e.g. VAX-11/780 has a CPI of8)Clock cycle time is also reduced.


Characteristics of RISC

All instructions are register-register type except LOAD andSTORE (which access memory)Fixed length and simple, fixed format instructions that donot cross main memory word boundary.Relatively few operations and addressing modes.Use of hardwired rather than microprogrammed control.Use of instruction pipelining and extensive software andhardware techniques to eliminate pipeline disruption.Rely on optimizing compiler to enhance systemperformance.RISC is a design philosophy where performance of CPU isenhanced.Nowadays, usually we will see some kind of hybrid design.


Documents

CSIS1120A - 14. Processor Organizationi.cs.hku.hk/~kpchan/CSIS1120A/Notes/14.processor.pdf · Processor Organization The processor performs operations via two operations: 1 moving