PROCESSOR ARCHITECTURE

Jehan-François Pârisjparis@uh.edu

Chapter Organization

• Logic design conventions• Implementation of a "toy" CPU• Pipelining• Pipelining hazards

– Data hazards– Control hazards

• Exceptions• Parallelism

IMPORTANT

LOGIC DESIGN CONVENTIONS

Combinational/state elements

• Combinational elements:– Outputs only depend on current inputs– Stateless

• Adders and, more generally, arithmetic logic unit (ALU)

Combinational/state elements

• State elements: – Have a memory holding a state– Output depends on current inputs and state of

element – State reflects past inputs

• Flip-flops, …

Judicial analogy

• In our legal system– Guilty/not guilty decision is stateless

• Good reasons– Sentencing decision is not

• "Three strikes and you are out" laws• Good reasons

Clocking methodology

• We will assume an edge-triggered clocking technology– Edge is short-enough to prevent data

propagation in state elements– Can read current state of a memory element

at the same time we update it

Clocking convention

• Omit write control signal if state element is updated at every active clock edge

A "TOY" CPU

Motivation

• "Toy" CPU will implement a subset of MIPS instruction set

• Subset will be– Self-sufficient– Simpler to implement– Complex enough to allow a serious

discussion of CPU architecture

The subset

• Will include– Load and store instructions:

lw (load word) and sw (store word)– Arithmetic-logic instructions:

add, sub, and, or and slt (set less than)– Branch instructions:

beq (branch if equal) and j (jump)

Load and store instructions

• Format I• Three operands:

– Two registers $r1 and $r2– One displacement d

• lw $r1, d($r2) loads into register $r1 main memory word at address contents($r2) + d

• sw $r1, d($r2) stores contents of register $r1 into main memory word at address contents($r2) + d

Arithmetic-logic instructions

• Format R• Three operands:

– Three registers $r1, $r2 and $r3• Store into register $r1 result of $r2 <op> $r3

where <op> can be add, subtract, and, oras well as set if less than

Branch instruction

• Format I• Three operands:

– Two registers $r1 and $r2– One displacement d

• beq $r1, $r2, dset value of PC to PC+4 + 4×diff $r1 = $r2

The simplest data path

• Assume CPU will do nothing but – Incrementing its program counter and– Deliver the next instruction

The simplest data path

InstructionMemory

Read address

Instruction

Implementing R2R instructions

• Takes two 32-bit inputs• Returns

– A 32-bit output– A 1-bit signal if the result is zero

The register file

• Two read outputs that are always available• One write input activated by a RegWrite signal

• Three register selectors

The register file

Read select 1 Read data 1

Read select 2 Read data 2

Write select Write data

RegWrite:enables register writes

Implementing R2R instructions

Registerfile

Result

RegWrite is enabled

Implementing load and store

• Require– An address calculation:

• contents($r2) + d– An access to data memory

• Before doing the address calculation, we must transform 16-bit displacement d into a 32-bit value using sign extension

The data memory

• One address selector• One write data input• One read data output• Two controls

– MemWrite– MemRead

Sign extension (I)

• If 16-bit number has a zero as MSB– It is positive– Must add 16 zero bits

0110 1010 1010 0100

0110 1010 1010 01000000 0000 0000 0000

Sign extension (II)

• If 16-bit number has a one as MSB– It is negative– Must add 16 one bits

1110 1010 1010 0100

1110 1010 1010 01001111 1111 1111 1111

The data memory

Memory address Read data

Write data

MemRead: enables memory reads

MemWrite: enables memory writes

Implementing the store instruction

Registerfile

ALU Address Read

Sign-extended d fieldSE

Implementing the load instruction

Registerfile

ALU Address Read

SEd field

Implementing conditional branch

• Target Address:– Sign-extend 16-bit immediate part of instruction– Shift left 2– Add to PC

• Branch Control Logic:– Perform test operation on two registers– Check result

Implementing conditional branch

Registerfile

AddShiftleft 2

Branch

Destination

To branchcontrol logic

d field ofinstruction

SE Sign-extended d field

• Arithmetic-logic operations only use – Register file and ALU

• Load and store use– ALU for computing memory address– Data memory

Implementing other instructions

Combining everything

Left to be done

• All control signals:– Two multiplexers: ALUSrc and MemtoReg– RegWrite, MemRad and MemWrite switches– ALU controls (4 bits)

ALU control signals

ALU control lines Function0000 and0001 or0010 add0110 subtract0111 set on less than1100 nor (not in "toy" subset)

Controlling the ALU

• Recall that all R-format instructions have same opcode– Operation performed by ALU is specified in the

function field (bits <0:5>)

Controlling the ALU

• ALU control inputs generated by two-step process– Construct two ALUOp control bits from

opcode– Construct four ALU control bits using

• Two ALUop bits• Six bits from function field when they are

needed

Dependence table

Opcode ALUOp Operation Function Action ALU Ctllw 00 lw - add 0010sw 00 sw - add 0010beq 01 beq - subtract 0110

R-type 10 add 100000 add 0010R-type 10 subtract 100010 subtract 0110R-type 10 and 100100 and 0000R-type 10 or 100101 or 0001R-type 10 slt 101010 slt 0111

• Two step process simplifies combinatorial logic• Many don't care conditions in truth table

Truth table

ALUOp1

ALUOp2

F5 F4 F3 F2 F1 F0 ALU Control bits

0 0 X X X X X X 00100 1 X X X X X X 01101 0 X X 0 0 0 0 00101 X X X 0 0 1 0 01101 0 X X 0 1 0 0 00001 0 X X 0 1 0 1 00011 X X X 1 0 1 0 0111

• Bits 4 and 5 of function field are not used• ALUOp bits only have three possible values:

00, 01 and 10– Introduces don't care conditions

• All R instructions use same data paths– Other control bits depend only on opcode

Control signal effectsSignal When deasserted When asserted

Regdest Destination register comes from rt field (bits 20:16)

Destination register comes from rd field (bits 15:10)

Regwrite None Enables write into destination register

ALUSrc Second ALU operand comes from second register output

Second ALU operand comes from sign-extended displacement(bits 15:0)

Control signal effects

Signal When deasserted When asserted

PCSrc PC is incremented by 4

PC set to branch target value

MemRead None Enables memory read output

MemWrite None Enables memory writeMemtoReg Value fed to

destination register comes from ALU

Value fed to destination register comes from memory

• PCSrc is asserted when– Instruction is a branch

and– ALU Zero result bit is asserted

• We will introduce a Branch control line

Control line settings

Instruction Rdest ALUsrc MemtoReg RegWrite

R-format 1 0 0 1lw 0 1 1 1sw X 1 X 0beq X 0 X 0

Control line settings

Instruction MemRead

Mem Write

Branch ALUOp 1

ALUOp 0

R-format 0 0 0 1 0lw 1 0 0 0 0

sw 0 1 0 0 0beq 0 0 1 0 1

Active datapaths for a R instruction

Active datapaths for a load instruction

Active datapaths for a beq instruction

The “weird" jump instruction

• Uses J format– Single 26 bit operand– Implements an unconditional jump

• New value of PC is obtained as follows– Bits 1:0 are zero (address is multiple of 4)– Bits 28:2 come from jump operand– Bits 31:29 come from PC+4

Implementing the jump instruction

Limitations of single-cycle design

• If we want all instructions to be executed in one cycle– Clock cycle must be long enough to

accommodate instruction taking the most time• Floating-point multiply or divide

• Does not work for CPUs that have a rich instruction set

PIPELINING

An analogy (I)

• Washing your clothes– Four steps:

1. Putting in the washer2. Putting in the dryer3. Folding/ironing4. Putting them away

An analogy (II)

• Most people– Start second wash load as soon as first wash

load is in dryer– Put second wash load in dryer and start a

third wash load while they are folding/ironing the firs washload

Purely sequential approach

Time 6 pm 6:30 7pm 7:30 8pm 8:30 9pm 9:30

Wash Dry Fold Store

Smart approach

Time 6 pm 6:30 7pm 7:30 8pm 8:30 9pm 9:30

Wash Dry Fold Store

Solution assumes that a housemateputs folded/ironed clothes away for us

Main advantage

• Can do much more in much less time

Limitation

• Slowed down by time taken by longest step– Could be washing/drying/ironing

Instruction steps (I)

• Good candidates for pipelining steps1. Fetch instruction from memory2. Decode instruction3. Read registers4. Execute register to register operation or

calculate address5. Access operand in memory6. Write results into a register

Instruction steps (II)

• Since MIPS instruction set has fixed fields, we can combine steps 2 and 3

1. Fetch instruction from memory2. Read registers while decoding instruction3. Execute register to register operation or

calculate address4. Access operand in memory5. Write results into a register

Sample step timingsInstruction

classInstruction

fetchRegister

readALU

operationData

accessRegister

writeTotaltime

Load word (lw)

200 ps 100ps 200ps 200ps 100ps 800ps

Store word (sw)

200 ps 100ps 200ps 200ps --- 700ps

R format instruction

200 ps 100ps 200ps -- 100ps 600ps

Branch(beq)

200 ps 100ps 200ps -- -- 500 ps

Step 1: Fetch and decode

Step 2: Read registers

Step 3: Use the ALU

Step 4: Access operand in memory

Step 5: Store result in register

Observations

• Most R format instructions operate on three registers and skip step 4

• Same for most I format instructions with an immediate operand

• Store operations skip step 5• Load register instructions go through all five

Pipelining limitations

• Some instructions that skip a step will still have to wait until preceding instruction is done.

• Hazards:– An instruction cannot proceed because

• Hardware cannot support the combination of instructions (structural hazards)

• Data are not ready (data hazards)• Control/branch hazards

Structural hazards

• Combinations of instructions that prevent pipelining

A bad MIPS instruction (I)

• Recall that IBM instructions set had instructions allowing to add to a register the contents of a memory location– RX format

A bad MIPS instruction (II)

• We could think of a MIPS instruction with three registers operands

ADDX r1, r2, r3adding to r1 the contents of the word at address contents of r2 + contents of r3

• We would have r1 = r1 + Mem[r2+r3]

A bad MIPS instruction (III)

• It would be great for accessing arrays– r2 will have starting address of array– r3 would contain the array index multiplied by

(fixed value)

(incremented after each step)

A bad MIPS instruction (IV)

• Adding this instruction would be a very bad idea

– Why?

Answer

• Instruction would require two steps using the ALU– Adding r2 and r3 to compute the address of the

memory operand (step 4)– Adding the memory operand to r1

• New step would introduce a structural hazard by preventing any other instruction to access the ALU

My comment

• Careful design of the MIPS CPU and instruction set should be noted – Not true for older instructions sets

• IBM 360, DEC VAX, …– Not true for X86 instruction sets

• CPU is designed to be compatible with an existing instruction set

Designing instruction sets for pipelining (I)

• All instructions should have the same length– Can fetch future instructions before the

current one is decoded• Have few instruction formats with register fields

always in the same position– Can combine instruction decode and register

read steps

Designing instruction sets for pipelining (II)

• Memory operands should only appear in load and store instruction– No instruction can use the ALU twice!

• Operands must be properly aligned in memory– Can always access them in a single memory

Data hazards (I)

• Assume we have add $s0, $t0, $t1sub $t2, $s0, $t3

ors0 = t0 + t1t2 = s0 – t3

• Need result of add before proceeding with sub instruction

Detail of steps

Cycle 1 2 3 4 5 6add IF ID/RR ALU RWsub IF stall stall ID/RR ALU

• Second instruction must wait until first instruction updated $s0 in cycle 4 before reading its value in cycle 5

Data hazards (II)

• New value of $s0 computed by the add instruction is not stored in $s0 until its step 5 has completed

• New instruction must wait until add instruction has performed its step 5 before performing its step

Data hazards (III)

addsub

Data hazards (IV)

• We lose two cycles during which nothing can be done

• Cannot trust compiler to remove all data hazards• Observe that new value of $s0 become available

at the end of step 3 of add instruction– Add special circuitry to provide this value at the

end of step 2 of sub instruction • Forwarding or bypassing

After forwarding

addsub

Detail of steps

Cycle 1 2 3 4 5 6add IF ID/RR ALU RWsub IF ID/RR ALU RW

• Second instruction now gets updated value at the end of cycle 3 just in time to use it in cycle 4– No stall cycles

Limitations (I)

• Forwarding worked very well because output of step 4 of add was forwarded to be input of step 3 of sub

• Would not work as well if output of an instruction step is need as input of instruction step of next instruction– Will still have one or more pipeline stalls

(bubbles)

Limitations (II)

• Assume we have lw $s0, 20($t1)sub $t2, $s0, $t3

ors0 = Mem[t1+20]t2 = s0 – t3

• Need new value of s0 before proceeding with sub instruction

Limitations (III)

addsub

Detail of steps

Cycle 1 2 3 4 5 6lw IF ID/RR ALU MEM RWsub IF ID/RR stall ALU RW

• Even with forwarding second instruction must wait until completion of memory access of first instruction in cycle 4 before performing its ALU step in cycle 5 – One stall cycle

A last word

• In many architectures, the floating point unit is a significant source of structural hazards– Less well adapted to pipelining

• The MIPS architecture assumes that we have separate memories for instructions and data– Having a single memory for both would result

in many more hazards

Control / jump hazards

• Happen whenever we have a conditional jump• Consider the instructions

add $4, $5,$6beq $1,$2, 40or $7, $8, $9

• Need result of conditional branch (beq) before deciding whether to execute next instruction (or)

Control hazards (II)

Pipelined datapath

Datapaths for pipelined organization

• Define five steps1. Fetch instruction from memory (IF)2. Instruction decode and register reads (ID)3. Execute AL operation on ALU (EX)4. Access operand in memory (MEM)5. Write back results into a register (WB)

Datapaths for pipelined organization

• Insert registers to save outputs of each step before they get updated by th next step

1. IF/ID registers2. ID/EX registers3. EX/MEM registers4. MEM/WB registers

A first try

NewNewNewNew

Comments

• This first try is not correct – Load instruction will not be implemented

correctly• Address of destination register will be lost

as soon as new instruction will be fetched• Must save it at each step

The almost correct datapaths

Register address follows

instruction

The almost correct datapaths

PROCESSOR ARCHITECTURE

Documents

Architecture (I) Processor Architecture. – 2 – Processor Goal Understand basic computer organization Instruction set architecture Deeply explore the CPU

Basic Processor Architecture

Introduction to Processor Architecture

architecture of 8085 processor

Processor Architecture - CS61 · •Processor architecture •Logic gates •Adders and multiplexors •Registers •Instruction set encoding •A sequential processor •Pipelining

Superscalar Processor Design Superscalar Architecture

ARM Processor Architecture - jits.ac.in

Processor Architecture Overview

Program Optimization Processor Architecture, PKU

COS2014 IA-32 Processor Architecture

Intel® Processor Architecture

01 intel processor architecture core

ARM Processor Architecture

8085 micro processor architecture

Intel Processor Architecture-Core

ARM Processor Architecture (II)

Inter-Processor Parallel Architecture

David O’Hallaron Carnegie Mellon University Processor Architecture Logic Design Processor Architecture Logic Design

Intel® Processor Graphics: Architecture & Programming · Intel® Processor Graphics: Architecture & Programming Jason Ross ... (e.g. VME, media surface sharing, ... Intel Processor

Guidelines for migrating to Intel(R) Atom(TM) Processor from other Processor architecture · AtomTM Processor from other Processor architecture ... Disadvantages ... Actual code processing